Data Impostor

May 25, 2026

If you thought the speed of writing code was your problem - you have bigger problems by Michiel Scholten

AI CoE Tech Stack Squad - Agile Board

S3 Files works best when you need interactive, shared access to data that lives in Amazon S3 through a high performance file system interface. It’s ideal for workloads where multiple compute resources—whether production applications, agentic AI agents using Python libraries and CLI tools, or machine learning (ML) training pipelines—need to read, write, and mutate data collaboratively. You get shared access across compute clusters without data duplication, sub-millisecond latency, and automatic synchronization with your S3 bucket.

The Markdown File That Beat a $50M Vector Database by Micheal Lanham

Derived retrieval layers. When scale demands semantic search, you build an index over the files. OpenClaw does this with SQLite and sqlite-vec. The files remain the source of truth. The index is a search optimization.
Start with a Markdown file. You can always add a database later.

How we build evals for Deep Agents by LangChain Accounts

More evals ≠ better agents. Instead, build targeted evals that reflect desired behaviors in production.

Beyond the dashboard: how BlaBlaCar PMs use AI to self-serve data by Dorothée Clerc

The Barrier: It lacked our internal map. Users were forced to replace generic table placeholders with real column names manually; a tedious process that was highly prone to human error. To be able to scale this, the AI needed to know our specific architecture, not just general SQL syntax.

May 13, 2026

State of Context Engineering in 2026 by Aurimas Griciūnas

May 11, 2026

Nobody understands the point of hybrid cars

May 4, 2026

Agent-driven development in Copilot Applied Science by Tyler McGoffin

It turns out that the things that make human engineers the most effective at doing their jobs are the same things that make these agents effective at doing theirs.

Apr 29, 2026

Most people care about farm animals — our food system doesn’t reflect that by Pablo Rosado

The invisible engineering behind Lambda’s network by

Apr 28, 2026

DuckLake v1.0: The Lakehouse Format Built on SQL Reaches Production-Readiness by The DuckDB team

Apr 27, 2026

Who will pay for 100 million boomer pensions?

Apr 20, 2026

Open source security at Astral

Apr 19, 2026

How to use Parquet Column Indexes with Amazon Athena

European AI: a playbook to own it | Mistral AI by Mistral AI

Inside Meta’s Home Grown AI Analytics Agent by Michel Tricot

Medallion Architecture Isn’t As New As You Think by Chris Gambill

Components of A Coding Agent by Sebastian Raschka, PhD

Apr 6, 2026

Harden your GitHub Actions Workflows with zizmor, dependency pinning, and dependency cooldowns - Matthias Schoettle by Matthias Schoettle

Apr 5, 2026

Vulnerability Research Is Cooked by Simon Willison

Choosing an EU Cloud Provider in 2026 by eualternative.eu

Apr 4, 2026

Apr 3, 2026

Battery costs have declined by 99% in the last three decades, making electrified transport a reality by Hannah Ritchie, Pablo Rosado

Amazon Bedrock AgentCore Evaluations is now generally available by Amazon Web Services

Mar 30, 2026

Breaking the Microbatch Barrier: The Architecture of Apache Spark Real-Time Mode

Microbatch mode processes batches of data called epochs. Epoch boundaries are decided upfront using start and end offsets. Real-time mode instead processes longer duration epochs but modifies how data flows within each epoch.
We essentially evolved the micro-batch in Structured Streaming into a checkpoint interval.

Still Missing Critical Pieces by Julien Simon

Building an MCP Ecosystem at Pinterest by Pinterest Engineering

Contrast with the MCP OAuth StandardThe MCP specification defines an OAuth 2.0 authorization flow where users explicitly authenticate with each MCP server, typically involving consent screens and per-server token management. Our approach is different: users already authenticate against our internal auth stack when they open a surface like the AI chat interface, so we piggyback on that existing session. There is no additional login prompt or consent dialog when a user invokes an MCP tool

Cognitive Helmets for the AI Bicycle Part 2: The Sometimes-Wrong Bot by Cat Hicks

developers in my interviews have pointed out that in their first months using Claude Code in a more “raw” way, counting on themselves to manage and monitor every output for sustained hours, they have felt a creeping sense of fatigue. One person called this “over-monitoring,” and multiple people have used the metaphor of “becoming a manager
If our goal is to help our junior colleagues integrate into organizational goals to use this tooling, we also need to listen to them about their challenges and friction points, and believe in their potential for learning.

Cognitive Helmets for the AI Bicycle: Part 1 by Cat Hicks

Avoid the temptation to spin up so many parallel tasks that you are in constant “cram.”
Another metacognitive strategy is something with the unglamorous name of pretesting. But it’s actually a fascinating window of insight into that “functional architecture” of our problem-solving minds. Simply put, if we prompt ourselves to try to generate an answer for something we don’t know before we go try to learn it, we learn better.

Your Data Agents Need Context

What agentic coding tools such as Claude Code are doing is making data engineers vastly more productive

Beyond Hypermodern: Python is easy now

Or do it dynamically: [project] name = “postmodern” dynamic = [“version”] …

[tool.hatch.version] source = “vcs”

Rethinking open source mentorship in the AI era by Abigail Cabunoc Mayes

CImplementationComprehensionRequire issue before pull requestHost an in-person code sprint for live discussionsContextAdd AI disclosure or AGENTS.mdContinuityWatch who comes back
AI tools are here to stay. The question is whether we adapt our practices to maintain what makes open source work: human relationships, knowledge transfer, and the multiplier effect.

OpenClaw and the Dream of Free Labour by The Daemon

Variant Type in Apache Parquet for Semi-Structured Data

Variant type—a feature that brings native support for semi-structured data to Parquet, significantly improving efficiency compared to less efficient formats such as JSON
Traditional approaches that store JSON as text strings require full parsing to access any field, making queries slow and resource-intensive. Variant solves this by storing data in a structured binary format that enables direct field access through offset-based navigation. Query engines can jump directly to nested fields without deserializing the entire document, dramatically improving performance.
Binary encodings like BSON improve upon plain JSON by storing data in binary format, but they still redundantly store field names like “timestamp”, “user”, and “event” in every row, wasting storage space
Variant data can be shredded by extracting frequently accessed fields into separate, strongly-typed columns
If the field matches the expected schema, its value is written to the strongly typed field.If the field does not match, the original representation is written as a Variant-encoded binary field and the corresponding strongly typed field is left NULL.

The Art of Learning in the AI Age by Jose Blanca

Exercises are opportunities to practice. It is through this practice that you develop your problem-solving skills. You will be tempted to let the AI write the code for you, but if you want to grow, you must resist that urge. If your objective is learning, do not use AI to write code you don’t understand—unless you intend to study that code until you do.
You won’t learn German or Chinese just by reading a grammar book or a dictionary. Similarly, you won’t become a good programmer just by reading about syntax. At the start of your journey prog

The Reviewer Isn’t the Bottleneck by Rishi Baldawa

Whether you can systematically extract what a good reviewer knows and run it at CI speed, I genuinely don’t know. Every check you write is one less thing a human has to catch. But reviewers don’t just catch bugs
They catch drift, intent mismatches, architectural decisions that look fine locally and cause problems three services away

ETL is Dead by Ananth Packkildurai

Mar 29, 2026

Lower your warehouse costs via DuckDB transpilation by Max Halford

How we optimized Dash’s relevance judge with DSPy by Eric Wang

Package Managers Need to Cool Down by Simon Willison

Today’s LiteLLM supply chain attack inspired me to revisit the idea of dependency cooldowns, t

Bucketsquatting is (Finally) Dead – One Cloud Please

Just this: are you willing to look stupid today? That’s it. That’s all there is to it.

Mar 27, 2026

The Ivy Lee Method: Focus Better with This 100-Year-Old Strategy

Making Retrospectives Effective with Small Concrete Actions and Rotating Facilitators by Ben Linders

He also encouraged rotating retrospective facilitators. It is challenging to fairly represent one’s own ideas and opinions while facilitating the retrospective, as every person brings their own unique perspective, Žabkar Nordberg said. Having different people who facilitate retrospectives helps build ownership and engagement

Mar 26, 2026

Self Hosted Search by [KemoNine]

Rob Pike’s 5 Rules of Programming

Data structures, not algorithms, are central to programming.

Mar 24, 2026

Using Git with coding agents by Simon Willison

Git has a mechanism called the reflog which can often capture details of code that hasn’t been committed to a permanent branch. Agents can search that, and search other branches too.
When you run a bisect operation you provide Git with some kind of test condition and a start and ending commit range. Git then runs a binary search to identify the earliest commit for which your test condition fails.

Mar 20, 2026

In defense of not reading the code by Ben Shoemaker

Mar 18, 2026

Unread Bookmarks - Readeck by Dennis Traub

Mar 15, 2026

Announcing DuckDB 1.5.0 by The DuckDB team

DuckDB now natively supports the VARIANT type, inspired by Snowflake’s semi-structured VARIANT data type and available in Parquet since 2025. Unlike the JSON type, which is physically stored as text, VARIANT stores typed, binary data. Each row in a VARIANT column is self-contained with its own type information. This leads to better compression and query performance.
DuckDB also supports reading VARIANT types from Parquet files, including shredding (storing nested data as flat values).

I don’t know if my job will still exist in ten years

Mar 13, 2026

Why is WebAssembly a second-class language on the web? – Mozilla Hacks - the Web developer blog by Ryan Hunt

Mar 10, 2026

Balancing cost and reliability for Spark on Kubernetes by Justin Lee

On Kubernetes, we use Karpenter with EKS Auto Mode for node management.Using this setup, Spark jobs do not need to set instance types or cluster sizes. They state their CPU and memory requirements, and Karpenter picks the optimal capacity, starts nodes as needed, and removes them when they are no longer needed. This dynamic provisioning both simplified and optimized our node management.
In practice, our jobs using Spot Instances often failed.
Spot Balancer is a Kubernetes tool that manages how a Spark job’s executors are split between spot and on-demand capacity. This gives us more control over spot usage per job, not just across the whole cluster.

Mar 9, 2026

Introducing the Apache Iceberg File Format API - Apache Iceberg™

The File Format API introduces a unified, extensible layer that engines can rely on when reading and writing Iceberg data files in any supported format. It will ship in the upcoming Apache Iceberg 1.11.0 release, making it available to all engines that use the Iceberg Java readers and writers.

Specs Should Be Equations, Not Essays by Benoit Pimpaud

Anti-patterns: things to avoid by Simon Willison

Don’t file pull requests with code you haven’t reviewed yourself.

Mar 8, 2026

Interactive explanations by Simon Willison

Mar 7, 2026

We Might All Be AI Engineers Now — Yas

Mar 3, 2026

Agentic Search over Graphs of Long Documents (or LAD-RAG++) by Pierce Lamb

had recently read this blog: The RAG Obituary where the author argued that retrieving (or investigating) over long documents was better suited to a Claude Code approach: provide the raw document data in a file system and just give an agent some foundational tools to interact with that raw document data.
Luckily, LAD-RAG’s approach to inference was exactly this: process the document into this graph structure, then provide an “agent” a set of tools to retrieve/explore that graph system and let it decide how it wants to proceed. So LAD-RAG was ticking a lot of boxes:Use the chunking mechanism the author intended via layoutMaintain semantic connections across pagesProvide this data via a set of tools to an agent to answer questions

Mar 2, 2026

Data Engineering After AI by Ananth Packkildurai

How I got Claude to teach me dbt

LLMs are rightly infamous for confidently asserting complete nonsense, and whilst it’s got a lot better in recent months, Claude is still not perfect, as I found when I challenged another aspect of its implementation ideas:

But then…Claude saves itself by owning its error, and then going to check what the actual values of the field are for itself…nice!

Could I edit a file by hand by figuring out the ASCII byte values to write to disk with dd? Umm, I guess? Does that mean I don’t use a text editor? Of course not. It’s about understanding the abstraction, the capability of the tools, and making an active, conscious, and educated decision about how to use them.
The risks? Plenty. Getting distracted and taking Claude on a flight of fantasy that may be fun but ultimately a waste of time. Working with technology which is at the edges (or beyond) Claude’s training dataset. Not having enough context for the area and trusting blindly what Claude tells you.

Mar 1, 2026

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest by Pinterest Engineering

Anthropic Study: AI Coding Assistance Reduces Developer Skill Mastery by 17% by Steef-Jan Wiggers

Anthropic recently published a randomized controlled trial showing developers using AI coding assistance scored 17% lower on comprehension tests than those coding manually, with productivity gains failing to reach statistical significance
wonder if we’re going to have a future where the juniors never gain the skills and experience to work well by themselves, and instead become entirely reliant on AI.
AI is incredibly useful as a personal tutor.
AI can reduce task completion time by 80% for tasks where developers already have relevant skills.

AI “Vibe Coding” Threatens Open Source as Maintainers Face Crisis by Steef-Jan Wiggers

Linear walkthroughs by Simon Willison

Feb 25, 2026

Writing code is cheap now - Agentic Engineering Patterns by Simon Willison

The AI Evolution of Graph Search at Netflix: From Structured Queries to Natural Language by Netflix Technology Blog

To determine which subset of fields to include in the context, we “match” them against the intent of the user’s questio

Feb 23, 2026

CQRS in Python: Clean Reads, Clean Writes

Graph RAG in 2026: A Practitioner’s Guide to What Actually Works by Alexander Shereshevsky

Feb 22, 2026

Bruteforcing the Bitwarden master password I forgor

Introduction to PostgreSQL Indexes

Although it gracefully handles hash conflicts, it works better for even distribution of hash values and is most suited to unique or mostly unique data
Nodes in BRIN indexes store the minimum and maximum values of a range of values present in the page referred by the index. This makes the index more compact and cache friendly, but restricts the use cases for it.
Generalized inverted index is appropriate for when you want to search for an item in composite data, such as finding a word in a blob of text, an item in an array or an object in a JSON

Your agents need runbooks, not bigger context windows by Ben Lorica 罗瑞卡

Context File System (CFS). You might also hear this more broadly categorized as an Operational Skill Store. This architecture separates the expensive reasoning of a large language model from the actual storage of operational knowledge. It mirrors the way a mature engineering team works.

Lance table format explained simply

Context Management for Deep Agents by LangChain Accounts

Context compression refers to techniques that reduce the volume of information in an agent’s working memory while preserving the details relevant to completing the task.
Offloading large tool results: We offload large tool responses to the filesystem whenever they occur.
Offloading large tool inputs: When the context size crosses a threshold, we offload old write/edit arguments from tool calls to the filesystem.
Summarization: When the context size crosses the threshold, and there is no more context eligible for offloading, we perform a summarization step to compress the message history.

Inside OpenAI’s in-house data agent

Performance Tips Using Postgres and pgvector | Crunchy Data Blog

Have enough RAM to build new indexes. Building indexes with larger lists requires higher settings for maintenance_work_mem — if you do not have the enough memory you’ll get an error. When building the lists = 2000 index above, the the maintenance_work_mem required 1.3GB of RAM.

Demystifying evals for AI agents

The agent shouldn’t be able to easily “cheat” the eval. Tasks and graders should be designed so that passing genuinely requires solving the problem rather than exploiting unintended loopholes.
Like the Swiss Cheese Model from safety engineering, no single evaluation layer catches every issue. With multiple methods combined, failures that slip through one layer are caught by another.
The patterns vary by agent type, but the fundamentals described here are constant. Start early and don’t wait for the perfect suite. Source realistic tasks from the failures you see. Define unambiguous, robust success criteria. Design graders thoughtfully and combine multiple types. Make sure the problems are hard enough for the model. Iterate on the evaluations to improve their signal-to-noise ratio. Read the transcripts!

Feb 21, 2026

Amazon Aurora DSQL now integrates with Kiro powers and AI agent skills by Amazon Web Services

Feb 19, 2026

The A.I. Disruption We’ve Been Waiting for Has Arrived by Simon Willison

Where Should You Deploy In 2026?

Feb 17, 2026

Deep Blue by Simon Willison

ve even faced accusations from my peers that I am actively harming their future careers through my work helping people understand how well AI-assisted programming can work.

A Single Reason To Not Vibe Code

Atrophy risk of cognitive skills amongst vibe coders is something IMHO that should be looked at more closely.

Feb 15, 2026

AI Trends Reshaping Data Engineering in 2026

Feb 13, 2026

AI Doesn’t Reduce Work—It Intensifies It by Simon Willison

AI fatigue is real and nobody talks about it by Siddhant Khare

Before AI, I might spend a full day on one design problem. I’d sketch on paper, think in the shower, go for a walk, come back with clarity. The pace was slow but the cognitive load was manageable. One problem. One day. Deep focus. Now? I might touch six different problems in a day. Each one “only takes an hour with AI.” But context-switching between six problems is brutally expensive for the human brain. The AI doesn’t get tired between problems. I do.
The cruel irony is that AI-generated code requires more careful review than human-written code.
If we can’t review everything AI produces - and we can’t, not at scale - then we need systems that constrain what agents can do in the first place.
The engineers I’ve talked to who handle this best are the ones who’ve made peace with it. They treat AI output like a first draft from a smart but unreliable intern. They expect to rewrite 30% of it. They budget time for that rewriting. They don’t get frustrated when the output is wrong because they never expected it to be right. They expected it to be useful. There’s a difference
. I now treat every AI output as a rough draft. A starting point. Raw material. I mentally label it “draft” the moment it appears, and that framing change alone reduced my frustration by half.
‘d been outsourcing my first-draft thinking to AI for so long that my ability to think from scratch had degraded.

Feb 9, 2026

Eight more months of agents

Pay through the nose for Opus or GPT-7.9-xhigh-with-cheese. Don’t worry, it’s only for a few years.

Beyond agentic coding

One of those design principles is my personal “master cue”, which is:

A good tool or interface should keep the user in a flow state as long as possible

Continuous AI in practice: What developers can automate today with agentic CI by GitHub Staff

Pattern 4: Debuggability will win over complexity

Developers will adopt agentic patterns that are transparent, auditable, and diff-based—not opaque systems that act without visibility.

Mitchell Hashimoto: My AI Adoption Journey by Simon Willison

A sane but extremely bull case on Clawdbot / OpenClaw | Brandon Wang

How AI assistance impacts the formation of coding skills

The participants who showed stronger mastery used AI assistance not just to produce code but to build comprehension while doing so—whether by asking follow-up questions, requesting explanations, or posing conceptual questions while coding independently.
w. It is possible that AI both accelerates productivity on well-developed skills and hinders the acquisition of new ones, though more research is needed to understand this relationship.

Reducing the cost of SSE-KMS with Amazon S3 Bucket Keys

Feb 8, 2026

Google Supercharges Gemini 3 Flash with Agentic Vision by Sergio De Simone

10 open source tools that feel illegal…

I miss thinking hard.

Even though the AI almost certainly won’t come up with a 100% satisfying solution, the 70% solution it achieves usually hits the “good enough” mark.

Feb 5, 2026

Trust signals are broken by Pedro Tavares

We Are The Art | Brandon Sanderson’s Keynote Speech by Brandon Sanderson

What we learned from the 3-body problem by Veritasium

Feb 3, 2026

The Missing Layer in Your AI Stack: Context, Not Just State by Ananth Packkildurai

AGENTS.md outperforms skills in our agent evals - Vercel by Jude Gao

Announcing Vortex Support in DuckDB by Guillermo Sanchez, SpiralDB Team

Feb 2, 2026

Gas Town’s Agent Patterns, Design Bottlenecks, and Vibecoding at Scale

Feb 1, 2026

Moltbook is the most interesting place on the internet right now by Simon Willison

You are being misled about renewable energy technology.

Jan 30, 2026

Google DeepMind Introduces ATLAS Scaling Laws for Multilingual Language Models by Robert Krzaczyński

Google DeepMind researchers have introduced ATLAS, a set of scaling laws for multilingual language models that formalize how model size, training data volume, and language mixtures interact as the number of supported languages increases.
Results show that fine-tuning is more compute-efficient at lower token budgets, while pre-training becomes advantageous once training data and compute exceed a language-dependent threshold. For 2B-parameter models, this crossover typically occurs between about 144B and 283B tokens, providing a practical guideline for selecting an approach based on available resources
Rather than an enormous model that is trained on redundant data from every language, how large would a purely translation model need to be, and how much smaller would it make the base model?

Google Introduces TranslateGemma Open Models for Multilingual Translation by Daniel Dominguez

Why DuckDB is my first choice for data processing

How to parametrize exception testing in PyTest? by Kacper Borucki

Jan 28, 2026

Tips for getting coding agents to write good Python tests by Simon Willison

Why Stoicism is one of the best mind-hacks ever devised by Lary Wallace

Only by envisioning the bad can we truly appreciate the good; gratitude does not arrive when we take things for granted. It’s precisely this gratitude that leaves us content to cede control of what the world has already removed from our control anyway.

AI’s trillion-dollar opportunity: Context graphs by Ashu Garg, Jaya Gupta

We call the accumulated structure formed by those traces a context graph: not “the model’s chain-of-thought,” but a living record of decision traces stitched across entities and time so precedent becomes searchable. Over time, that context graph becomes the real source of truth for autonomy – because it explains not just what happened, but why it was allowed to happen.
Once you have decision records, the “why” becomes first-class data. Over time, these records naturally form a context graph: the entities the business already cares about (accounts, renewals, tickets, incidents, policies, approvers, agent runs) connected by decision events (the moments that matter) and “why” links. Companies can now audit and debug autonomy and turn exceptions into precedent instead of re-learning the same edge case in Slack every quarter.
The orchestration layer sees the full picture: what inputs were gathered, what policies applied, what exceptions were granted, and why. Because it’s executing the workflow, it can capture that context at decision time – not after the fact via ETL, but in the moment, as a first-class record.

That’s the context graph, and that will be the single most valuable asset for companies in the era of AI.

High headcount. If a company has 50 people doing a workflow manually (routing tickets, triaging requests, or reconciling data between systems), that’s a signal. The labor exists because the decision logic is too complex to automate with traditional tooling.

Memory: How Agents Learn

Here’s the dirty secret: when building agents with the API, we’ve made them capable, but we haven’t yet figured out how to make them learn.
Pattern 1: Session Memory Store messages in a database, retrieve them before every response, add them to the context. Agno gives you this out of the box — just give your agent a database.
Pattern 2: User Memory Remember facts about the user across sessions. The MemoryManager extracts preferences automatically and stores them in the database.
Pattern 3: Learned Memory Now let’s add learned memory: insights that apply beyond just one user. The key is a custom tool that saves learnings to a knowledge base
The quality of your knowledge base determines the quality of learning. Garbage in, garbage out. The solution: the agent proposes learnings, but only saves with explicit user approval.
A learning is worth saving if it’s:

Specific: “Tech P/E ratios typically range 20-35x” not “P/E varies” Actionable: Can be applied to future queries Generalizable: Useful beyond this one conversation

Jan 26, 2026

Agent Skills in 2026 by Andre Landgraf

7 Things You Didn’t Know Dataclasses Could Do

Jan 25, 2026

Qwen3-TTS Family is Now Open Sourced: Voice Design, Clone, and Generation by Simon Willison

Jan 21, 2026

Prisma 7: Rust-Free Architecture and Performance Gains by Daniel Curtis

10 things I learned from burning myself out with AI coding agents by Benj Edwards

Jan 18, 2026

Histomat of F/OSS: We should reclaim LLMs, not reject them by hongminhee@hollo.social

Jan 15, 2026

Why Linus and DHH are vibe coding now

2026 is the Year of Self-hosting by Jordan Fulghum

The Hobbyist Maintainer Economic Gravity Well by Thomas Depierre

Jan 14, 2026

Don’t fall into the anti-AI hype

Jan 13, 2026

AWS CloudWatch Evolves into Unified Observability Platform with Apache Iceberg Support by Steef-Jan Wiggers

Don’t fall into the anti-AI hype

Run Your Project in a Dev Container, in Zed

Article: Agentic Terminal - How Your Terminal Comes Alive with CLI Agents by Sachin Joglekar

Jan 9, 2026

Formal Methods Only Solve Half My Problems by Marc Brooker

PostgreSQL 18 RETURNING Enhancements: A Game Changer for Modern Applications by @pgEdgeDistributedPostgres@mastodon.social

Jan 8, 2026

Why Didn’t AI “Join the Workforce” in 2025? - Cal Newport by Jay

Jan 7, 2026

Uncloud, self hosted Cloud, seen by a developer for developers

(17) The Ridiculous Engineering Of The World’s Most Important Machine by Veritasium

Databases in 2025: A Year in Review

Data Engineering in 2026: What Changes? by Ben Lorica 罗瑞卡

Google’s Eight Essential Multi-Agent Design Patterns by Sergio De Simone

Accounting for Computer Scientists — Martin Kleppmann’s blog

Jan 6, 2026

10 Predictions for Data Infrastructure in 2026

My LLM coding workflow going into 2026 by Addy Osmani

the first step is brainstorming a detailed specification with the AI
The key point is to avoid huge leaps. By iterating in small loops, we greatly reduce the chance of catastrophic errors and we can course-correct quickly. LLMs excel at quick, contained tasks - use that to your advantage.
think Claude Skills have potential because they turn what used to be fragile repeated prompting into something durable and reusable by packaging instructions, scripts, and domain specific expertise into modular capabilities that tools can automatically apply when a request matches the Skill
automated tests, do code reviews - both manual and AI-assiste
No matter how much AI I use, I remain the accountable engineer.
Frequent commits are your save points - they let you undo AI missteps and understand changes.
spin up a fresh git worktree for a new feature or sub-project. This lets me run multiple AI coding sessions in parallel on the same repo without them interfering, and I can later merge the changes
Use your CI/CD, linters, and code review bots - AI will work best in an environment that catches mistakes automatically.
one of my goals is to bolster the quality gates around AI code contribution: more tests, more monitoring, perhaps even AI-on-AI code reviews. It might sound paradoxical (AIs reviewing AIs), but I’ve seen it catch things one model missed.
Treat every AI coding session as a learning opportunity - the more you know, the more the AI can help you, creating a virtuous cycle.
Dunning-Kruger on steroids (it may seem like you built something great, until it falls apart

Python Numbers Every Programmer Should Know

Jan 3, 2026

Amazon S3 Vectors Reaches GA, Introducing “Storage-First” Architecture for RAG by Steef-Jan Wiggers

Dec 30, 2025

I switched to eSIM in 2025, and I am full of regret by Ryan Whitwam, follow him on Bluesky

The Model Context Protocol’s impact on 2025 by Richard Gall

How We Lost Communication to Entertainment by @ploum@mamot.fr, Ploum

Dec 29, 2025

Solve Hi-Q with AlphaZero and Curriculum Learning

Dec 28, 2025

How to Tell If Your Code Is Actually Production-Ready

Iceberg, The Right Idea - The Wrong Spec - Part 2 of 2: The Spec by The Doctor

Yet, because every write must repeat all the metadata that is already written - we introduce contention where none is needed.
“Transactions are used to commit multiple table changes in a single atomic operation”

Iceberg, The Right Idea - The Wrong Spec - Part 1 of 2: History by The Doctor

Dec 27, 2025

How uv got so fast by Andrew Nesbitt

HTTP range requests for metadata. Wheel files are zip archives, and zip archives put their file listing at the end. uv tries PEP 658 metadata first, falls back to HTTP range requests for the zip central directory, then full wheel download, then building from source. Each step is slower and riskier. The design makes the fast path cover 99% of cases. This is HTTP protocol work, not Rust.
Zero-copy deserialization. uv uses rkyv to deserialize cached data without copying it. The data format is the in-memory format. This is a Rust-specific technique.
v is fast because of what it doesn’t do, not because of what language it’s written in. The standards work of PEP 518, 517, 621, and 658 made fast package management possible. Dropping eggs, pip.conf, and permissive parsing made it achievable. Rust makes it a bit faster still.
pip could implement parallel downloads, global caching, and metadata-only resolution tomorrow. It doesn’t, largely because backwards compatibility with fifteen years of edge cases takes precedence. But it means pip will always be slower than a tool that starts fresh with modern assumptions.

Beyond Indexes: How Open Table Formats Optimize Query Performance by Jack Vanlightly

All rows sharing the same partition key values are written into the same directory or group of files. This creates data locality such that when a query includes a filter on the partition column (for example, WHERE EventDate = ‘2025-10-01’), the engine can identify exactly which partitions contain relevant data and ignore the rest. This process is known as partition pruning
Delta supports Z-ordering, which is a multidimensional clustering technique. Instead of sorting by a single column, Z-order interleaves the bits of multiple column values (for example, Country, Nationality, and EventDate) into a single composite key that preserves spatial locality
Z-ordering is particularly effective for queries that filter on multiple dimensions simultaneously
But when data isn’t organized by that column and all countries are mixed randomly across files, the statistics show it. Each file’s min/max range becomes broad, often spanning most of the column’s domain. The result is poor selectivity and less effective pruning, because the query planner can’t confidently skip any files
A Bloom filter is a compact probabilistic data structure that can quickly test whether a value might exist in a data file (or row group). If the Bloom filter says “no,” the engine can skip reading that section entirely
They effectively trade storage and maintenance cost for speed, just like secondary indexes do in the RDBMS
Open table formats like Iceberg, Delta, and Hudi store data in immutable, columnar files, optimized for large analytical scans.Query performance depends on data skipping (pruning), which is the ability to avoid reading irrelevant files or row groups based on metadata.Pruning effectiveness depends on data layout.Data layout levers:Partitioning provides strong physical grouping across files, enabling efficient partition pruning when filters match partition keys.Sorting improves data locality within partitions, tightening column value ranges and enhancing row-group-level pruning.Compaction consolidates small files and enforces consistent sort order, making pruning more effective (and reducing the small file cost that partitioning can sometimes introduce).Z-ordering (Delta) and Liquid Clustering (Databricks) extend sorting to multi-dimensional and adaptive clustering strategies
Column statistics in Iceberg manifest files and Parquet row groups drive pruning by recording min/max values per column. The statistics reflect the physical layout.Bloom filters add another layer of pruning, especially for unsorted columns and exact match predicates. Some systems maintain sidecar indexes such as histograms or primary-key-to-file maps for faster lookups (e.g., Hudi, Paimon).Materialized views and precomputed projections further accelerate queries by storing data in the shape of common query patterns (e.g., Dremio Reflections). These require some data duplication and data maintenance, and are the closest equivalent (in spirit) to the secondary index of an RDBMS
Both rely on structure and layout to guide access, the difference is that instead of maintaining B-tree structures, OTFs lean on looser data layout and lightweight metadata to guide search (pruning being a search optimization)
Secondary indexes, so valuable in OLTP, add little to data warehousing. Analytical queries don’t pluck out individual rows, they aggregate, filter, and join across millions
ou can’t sort or cluster the table twice without making a copy of it. But it turns out that copying, in the form of materialized views, is a valuable strategy for supporting diverse queries over the same table, as exemplified by Dremio Reflections. These make the same cost trade offs as secondary indexes: space and maintenance for read speed

Dec 26, 2025

From BI to AI: A Modern Lakehouse Stack with Lance and Iceberg by [Jack Ye Prashanth Rao]

Evaluating Deep Agents: Our Learnings by LangChain Accounts

Dec 25, 2025

DEW - The Year in Review 2025 by Ananth Packkildurai

Prompt caching: 10x cheaper LLM tokens, but how? | ngrok blog by ngrok

Zero to One: Learning Agentic Patterns by Philipp Schmid

An initial LLM acts as a router, classifying the user’s input and directing it to the most appropriate specialized task or LLM.
Reflection Pattern

An agent evaluates its own output and uses that feedback to refine its response iteratively. This pattern is also known as Evaluator-Optimizer and uses a self-correction loop

The key to success with any LLM application, especially complex agentic systems, is empirical evaluation. Define metrics, measure performance, identify bottlenecks or failure points, and iterate on your design. Resist to over-engineer

How to Choose the Right Embedding Model for RAG - Milvus Blog

Sparse vectors (like BM25) focus on keyword frequency and document length. They’re great for explicit matches but blind to synonyms and context—“AI” and “artificial intelligence” would look unrelated
Dense vectors (like those produced by BERT) capture deeper semantics. They can see that “Apple releases new phone” is related to “iPhone product launch,” even without shared keywords. The downside is higher computational cost and less interpretability
Since one token is roughly 0.75 words
The key is balance. For most general-purpose applications, 768–1,536 dimensions strike the right mix of efficiency and accuracy. For tasks that demand high precision—such as academic or scientific searches—going beyond 2,000 dimensions can be worthwhile. On the other hand, resource-constrained systems (such as edge deployments) may use 512 dimensions effectively, provided retrieval quality is validated. In some lightweight recommendation or personalization systems, even smaller dimensions may be enough
Under the hood, BERT’s input vectors combined three elements: token embeddings (the word itself), segment embeddings (which sentence it belongs to), and position embeddings (where it sits in the sequence). Together, these gave BERT the ability to capture complex semantic relationships at both the sentence and document level. This leap made BERT state-of-the-art for tasks like question answering and semantic search.
Dense vectors capture deep semantics, handling synonyms and paraphrases (e.g., “iPhone launch”, ≈ “Apple releases new phone”).
Sparse vectors assign explicit term weights. Even if a keyword doesn’t appear, the model can infer relevance—for example, linking “iPhone new product” with “Apple Inc.” and “smartphone.”
Multi-vectors refine dense embeddings further by allowing each token to contribute its own interaction score, which is helpful for fine-grained retrieval.
With the right tweaks, LLMs can generate embeddings that rival, and sometimes surpass, purpose-built models. Two notable examples are LLM2Vec and NV-Embed.
Screen with MTEB subsets. Use benchmarks, especially retrieval tasks, to build an initial shortlist of candidates. Test with real business data. Create evaluation sets from your own documents to measure recall, precision, and latency under real-world conditions. Check database compatibility. Sparse vectors require inverted index support, while high-dimensional dense vectors demand more storage and computation. Ensure your vector database can accommodate your choice. Handle long documents smartly. Utilize segmentation strategies, such as sliding windows, for efficiency, and pair them with large context window models to preserve meaning.

The Ultimate Guide to LLM Evaluation: Metrics, Methods & Best Practices by Kelsey Kinzer

Dec 23, 2025

“World Model” is a mess. Here’s how to make sense of it. by Ben Lorica 罗瑞卡

Polars at Decathlon: Ready to Play? by Arnaud Vennin

Stop Your Code from Crashing: Use the Retry Pattern

Lakehouses: the path to low-cost, infinitely scalable, and no-lock-in observability? by ClickHouse

Dec 22, 2025

AWS Launches ECS Express Mode to Simplify Containerised Application Deployment by Matt Saunders

Prompt caching: 10x cheaper LLM tokens, but how? | ngrok blog by ngrok

Create and update Apache Iceberg tables with partitions in the AWS Glue Data Catalog using the AWS SDK and AWS CloudFormation

APACHE SPARK OPTIMISATIONS by Guna Chandra Durgapu

Column Storage for the AI Era by Julien Le Dem

Agent Engineering: A New Discipline by LangChain

Agent engineering is the iterative process of refining non-deterministic LLM systems into reliable production experiences. It is a cyclical process: build, test, ship, observe, refine, repeat.

Getting into public speaking - James Brooks

Skills vs Dynamic MCP Loadouts by Armin Ronacher

You still declare tools ahead of time in the system message, but they are not injected into the conversation when the initial system message is emitted. Instead they appear at a later point. The tool definitions however still have to be static for the entire conversation, as far as I know. So the tools that could exist are defined when the conversation starts. The way Anthropic discovers the tools is purely by regex search.

Why your mock breaks later | Ned Batchelder by @nedbat@hachyderm.io

Introducing AWS Glue 5.1 for Apache Spark | AWS Big Data Blog

Context Engineering: How RAG, agents, and memory make LLMs actually useful

This dual-memory approach mirrors human cognition:

Short-term Memory (Redis): Like working memory, it holds the current conversation context. Fast access, automatic expiration Long-term Memory (Vector Store): Persistent knowledge that grows over time. Important patterns and learnings are embedded and searchable

Dec 19, 2025

Your job is to deliver code you have proven to work by Simon Willison

The first is manual testing. If you haven’t seen the code do the right thing yourself, that code doesn’t work. If it does turn out to work, that’s honestly just pure chance.
Some changes are harder to demonstrate. It’s still your job to demonstrate them! Record a screen capture video and add that to the PR. Show your reviewers that the change you made actually works.
The second step in proving a change works is automated testing. This is so much easier now that we have LLM tooling, which means there’s no excuse at all for skipping this step.
When working on CSS changes I’ll often encourage my coding agent to take screenshots when it needs to check if the change it made had the desired effect.
The human provides the accountability

AI will make formal verification go mainstream — Martin Kleppmann’s blog

The downfall of traditional media and the post literate society by Alice Cappelle

Poe the Poet by Simon Willison

Iceberg in the Browser by Carlo Piovesan, Tom Ebergen, Gábor Szárnyas

A little bit uncomfortable by

The bigger the fear, the stronger the signal that it actually matters

Introducing the Apache Spark troubleshooting agent for Amazon EMR and AWS Glue

You Gotta Push If You Wanna Pull

you want instant pulls, you need constant pushes

Dec 11, 2025

Introducing: Devstral 2 and Mistral Vibe CLI. by Mistral AI

Qwen3-Omni-Flash-2025-12-01：Hear You. See You. Follow Smarter! by Qwen Team

Dec 9, 2025

Why I’m not a fan of zero-copy Apache Kafka-Apache Iceberg by Jack Vanlightly

Introducing OpenZL: An Open Source Format-Aware Compression Framework by Chris Wiltz

However, while it was improved over time, remaining within the Zstandard framework offers diminishing returns. So we started looking for the next great leap in data compression.
General compressors rely on a one-size fits all processing strategy, or alternatively spend a lot of their cycles guessing which techniques to use
As a user, you provide OpenZL with the data shape (via a preset or a thin format description). Then the trainer, an offline optimization component, builds an effective compression config that can be re-employed for similar data. During encoding that config resolves into a concrete decode recipe that’s embedded into the frame. The universal decoder will directly execute that recipe, without any out-of-band information.
Describe the input: With the Simple Data Description Language (SDDL), you sketch how the bytes map to fields — rows, columns, enums, nested records. SDDL is for parsing only; it just tells OpenZL the shape of your data. Alternatively, you can write your own parser function directly using one of the supported languages, and register it with OpenZL to delegate the logic.

Dec 8, 2025

Data Inlining by GitHub User

can be wasteful to write each changeset to an individual Parquet file.

State of AI | OpenRouter

How Tables Grew a Brain: Iceberg, Hudi, Delta, Paimon, DuckLake by Anton Borisov

The lake stops acting like a filesystem with extras and starts behaving like a database fronting an object store
Choose your ride (intent over brand):Need stable, open analytics across engines? Start with Iceberg’s snapshot model.Need continuous upserts/deletes with low-lag views? Add Paimon/Hudi’s incremental/LSM ideas where it matters.Need multi-table ACID and fast planning? Move the brain into a catalog database (DuckLake-style).

Dec 7, 2025

Stacked Diffs with `git rebase --onto` by Dinesh Pandiyan

Stacked diffs solve this by breaking your work into smaller, dependent PRs:
git rebase —onto ↑ ↑ ↑ new parent old parent branch to rebase
The marker branch pattern takes the guesswork out of tracking the old base. Use it, update it, and your stacked diffs will stay clean.
Force pushes are required: Every rebase changes commit hashes, so you’ll be doing git push —force-with-lease a lot.
Marker branches need discipline: If you forget to update your marker, your next sync will be painful. Consider aliasing the full command: 1alias gsync=‘git rebase —onto $1 $2-base $2 && git branch -f $2-base $1’
Merge conflicts multiply: If you have conflicts when rebasing feature-1, you might hit them again when rebasing feature-2. That’s the nature of the beast.
Don’t stack too deep: Two or three levels is manageable. Beyond that, the maintenance overhead outweighs the benefits. I personally try to keep it at 2 levels max.

How agents can use filesystems for context engineering by LangChain Accounts

The Continual Learning Problem

Welcome to the age of $10/month Lakehouses by Tobias Müller

Mistakes I see engineers making in their code reviews

The DuckLake Manifesto: SQL as a Lakehouse Format by GitHub User

Data Compaction Avoidance: DuckLake requires far fewer compaction operations than comparable formats. DuckLake supports efficient compaction of snapshots.
changes to the data, DuckLake can optionally use the catalog database to store those small changes directly to avoid writing many small files.

Dec 6, 2025

Gemini 3 Pro: the frontier of vision AI by Rohan Doshi

The Performance Revolution in JavaScript Tooling by Damilola Olatunji

Vitest Team Releases Version 4.0 with Stable Browser Mode and Visual Regression Testing by Daniel Curtis

TIL: Subtests in pytest 9.0.0+ by Simon Willison

What are Deep Agents? by 瑞莎在硅谷

Vibe Analysis by Dan Hockenmaier

Vibe coding is for creating software. Vibe analysis is for creating insights
Jevons Paradox (which says that increased efficiency of resource use often leads to more consumption of that resource, not less)

Dec 5, 2025

Amazon Bedrock now supports reinforcement fine-tuning delivering 66% accuracy gains on average over base models

Models learn to align with your specific requirements using a small set of prompts rather than the large sums of data needed for traditional fine-tuning methods, enabling teams to get started quickly
You can define reward functions using verifiable rule-based graders or AI-based judges along with built-in templates to optimize your models for both objective tasks such as code generation or math reasoning, and subjective tasks such as instruction following or chatbot interactions. Your proprietary data never leaves AWS’s secure, governed environment during the entire customization process, mitigating security and compliance concerns.

Introducing AWS DevOps Agent (preview), frontier agent for operational excellence

AWS Lambda announces durable functions for multi-step applications and AI workflows

Amazon EMR Serverless eliminates local storage provisioning for Apache Spark workloads

Amazon S3 Vectors is now generally available with 40 times the scale of preview

With general availability, you can store and query up to two billion vectors per index and elastically scale to 10,000 vector indexes per vector bucket
Infrequent queries continue to return results in under one second, with more frequent queries now resulting in latencies around 100 milliseconds or less
application can achieve write throughput of 1,000 vectors per second when streaming single-vector updates into your indexes, retrieve up to 100 search results per query, and store up to 50 metadata keys alongside each vector for fine-grained filtering in your queries.
You can also tag vector buckets and indexes for attribute-based access control (ABAC) as well as to track and organize costs using AWS Billing and Cost Management

How Would You Like Your Iceberg Sir? Stream or Batch Ordered? — Jack Vanlightly by Jack Vanlightly

We call the reading from the historical source, bootstrapping
Fluss is a streaming tabular storage layer built for real-time analytics which can serve as the real-time data layer for lakehouse architectures
Fluss uses its own offset (akin to the Kafka offset) as the Iceberg sort order. This ensures that when Flink reads from Iceberg, it sees a temporally ordered sequence

Flink’s 95% problem

It’s true some late-arrival management is harder but that’s usually overengineering

Apache Kafka® (Kafka Connect) vs. Apache Flink® vs. Apache Spark™: Choosing the Right Ingestion Framework

AI vs Gen Z: How AI has changed the career pathway for junior developers - Stack Overflow

Simple Control Flow for Automatically Steering Agents

Embedding environment state validation directly into control flow ensures the agent continues until either:

The task is genuinely complete, or

The token budget is exhausted

Streaming Patterns with DuckDB by Guillermo Sanchez

Apache Parquet vs. Newer File Formats (BtrBlocks, FastLanes, Lance, Vortex) by Dipankar Mazumdar

AI pipelines require fast feature retrieval, vector search, and low-latency scoring.
Storage has changed — NVMe-backed systems and memory-mapped datasets call for fine-grained, cache-friendly data access.
Row groups and pages: Within this columnar layout, Parquet organizes data into row groups (commonly ~128 MB), which are further divided into column chunks and smaller pages. This structure define clear, fixed-size chunks of data and makes it easier for query engines to parallelize scans and skip over unneeded sections, improving efficiency at scale.
Encodings and compression: Each page can use type-specific encodings, such as Dictionary, Run-Length Encoding (RLE) or Delta encoding , combined with block compression (Snappy, Zstd, LZ4, GZIP). This two-layer design provides both speed and compactness.
Statistics and filtering. Parquet stores per-page and per-column statistics such as min/max values, null counts, and distinct counts. These allow query engines to skip pages or entire row groups when predicates fall outside recorded ranges. Parquet also supports dictionary filtering (using dictionary values for comparisons) and optional bloom filters for selective reads. Together, these features make predicate pushdown highly effective.
RAG workloads are especially sensitive to random access performance, since each query may need to fetch small slices of data across massive corpora stored on NVMe
BtrBlocks, developed at TUM, introduces the idea of cascaded lightweight compression (LWC). Instead of relying on heavyweight compressors like Zstd, it uses chains of lightweight encodings (bit-packing, dictionary, frame-of-reference). A greedy, sample-based algorithm selects the best chain per column segment.
Compressed execution. Instead of fully materializing decoded vectors, FastLanes returns compressed vectors directly to engines (e.g. DuckDB, Velox), allowing SIMD/GPU-friendly execution on compressed data.
Repetition index. Enables random access in 1–2 IOPS per lookup, independent of nesting depth. This is a dramatic improvement over Parquet, which scales poorly for nested data.
Meta introduced Nimble, a new columnar file format optimized for machine learning feature stores and very wide tables (thousands of columns). Its design goals:Lightweight metadata for handling extremely wide schemas.Cascaded encodings, with support for SIMD and GPU acceleration.Portable implementation for consistent decoding across engines.
For practitioners, the question becomes when to rely on Parquet’s universality and when to reach for a specialized format to unlock specific benefits

Dec 4, 2025

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail by Jonathan Kemper

Introducing Mistral 3 by Simon Willison

Progress on TypeScript 7 - December 2025 by Daniel Rosenwasser

Dec 3, 2025

Anthropic acquires Bun by Simon Willison

Amazon S3 Tables now offer the Intelligent-Tiering storage class

Benchmarking read latency of AWS S3, S3 Express, EBS and Instance store by Roman Grebennikov

10 Smart Performance Hacks For Faster Python Code | The PyCharm Blog by Evgenia Verbina

Hack 2: Avoid unnecessary copies

Copying large objects like lists, dictionaries, or arrays can be costly in both time and memory. Each copy creates a new object in memory, which can lead to significant overhead, especially when working with large datasets or within tight loops.

Hack 4: Use math functions instead of operators

For numerical computations, Python’s math module provides functions that are implemented in C, offering better performance and precision than equivalent operations written in pure Python.

Hack 6: Avoid exception handling in hot loops

While Python’s exception handling is powerful and clean for managing unexpected behavior, it’s not designed for high-frequency use inside performance-critical loops. Raising and catching exceptions involves stack unwinding and context switching, which are relatively expensive operations

Hack 9: Use bisect for sorted list operations

When working with sorted lists, using linear search or manual insertion logic can be inefficient – especially as the list grows. Python’s bisect module provides fast, efficient tools for maintaining sorted order using binary search.

Hack 10: Avoid repeated function calls in loops

Calling the same function multiple times inside a loop – especially if the function is expensive or produces the same result each time – can lead to unnecessary overhead. Even relatively fast functions can accumulate significant cost when called repeatedly in large loops.

Dec 2, 2025

Amazon CloudWatch incident reports now support Five Whys analysis

The capability leverages both human input and AI-based analysis of incident data to recommends specific measures operators can take to prevent future occurrences and improve their operations.
You can create an incident report by first creating a CloudWatch investigation and then clicking “Incident report”.

PostgreSQL grouping sets: ROLLUP & CUBE by Hans-Jürgen Schönig

ROLLUP is useful if you want to add the “bottom line”. However, you often want to see all combinations of countries and products. GROUP BY CUBE will do exactly that

On Idempotency Keys - Gunnar Morling

We can somewhat improve this situation by adding a timestamp to the idempotency key, for instance by using a UUIDv7 which contains both a timestamp part (first 48 bits) and a random part (remaining bits), or an ULID. That way, the consumer can detect when it receives a message with an idempotency key which is “too old”.
All these intricacies can be avoided when it is possible to use a monotonically increasing sequence value as the idempotency key.
log sequence numbers (LSN)
For many scenarios, using UUIDs and dropping them after some time will probably be sufficient, provided you can tolerate that messages occasionally can be processed a second time when duplicates arrive after the retention period of processed keys.
The more messages you need to process overall, the more attractive a solution centered around monotonically increasing sequences becomes, as it allows for space-efficient duplicate detection and exclusion, no matter how many messages you have
The proposed log-based approach can be an efficient solution for doing so, but it also adds operational complexity: your database needs to support logical replication, you need to run a CDC connector, etc. However, many organizations already operate CDC pipelines for other purposes (analytics, search indexing, cache invalidation, etc.). If you’re in that category, the incremental complexity is minimal

Accelerate data lake operations with Apache Iceberg V3 deletion vectors and row lineage

Data-at-Rest Encryption in DuckDB by Lotte Felius, Hannes Mühleisen

Starting with DuckDB 1.4.0, DuckDB supports transparent data encryption of data-at-rest using industry-standard AES encryption.
The user itself is responsible for the key management and thus for using a secure key

What it means to get your data ready for AI | by Lak Lakshmanan | Nov, 2025 | AI Advances by Lak Lakshmanan

The Case Against pgvector | Alex Jacobs by Alex Jacobs

Each insert acquires locks on the graph structure. Under heavy write load, this becomes a bottleneck
Pre-filter works great when the filter is highly selective (1,000 docs out of 10M). It works terribly when the filter isn’t selective—you’re still searching millions of vectors.
Timescale has released pgvectorscale, which addresses some of these issues. It adds:StreamingDiskANN, a new search backend that’s more memory-efficientBetter support for incremental index buildsImproved filtering performance
But here’s what I’ve learned: for most teams, especially small teams, dedicated vector databases are actually simpler
Index management is hard. Rebuilds are memory-intensive, time-consuming, and disruptive. Plan for this from day one
Query planning matters. Filtered vector search is a different beast than traditional queries, and Postgres’s planner wasn’t built for this.
Real-time indexing has costs. Either in memory, in search quality, or in engineering time to manage it.

Measuring what matters: How offline evaluation of GitHub MCP Server works by Ksenia Bobrova

Offline evaluation catches regressions before users see them and keeps the feedback loop short, so we can ship changes that genuinely improve performance

Frozen DuckLakes for Multi-User, Serverless Data Access by Mark Harrison (Madhive Data Engineering)

Weaponizing image scaling against production AI systems

How I Use Claude Code on My Phone with Termux and Tailscale by Nicholas Khami

Dec 1, 2025

Stop Hardcoding Everything: Use Dependency Injection

Writes in DuckDB-Iceberg by Tom Ebergen

Introducing AWS Glue 5.1

AWS Glue 5.1 introduces support for Apache Iceberg format version 3.0, adding default column values, deletion vectors for merge-on-read tables, multi-argument transforms, and row lineage tracking

deepseek-ai/DeepSeek-Math-V2 by Simon Willison

Electron vs. Tauri by Eric Richardson

The Thinking Game | Full documentary | Tribeca Film Festival official selection by GPT & Me

Super fast aggregations in PostgreSQL 19 by Hans-Jürgen Schönig

Amazon Adds A2A Protocol to Bedrock AgentCore for Interoperable Multi-Agent Workflows by Vinod Goje

Amazon announced support for the Agent-to-Agent (A2A) protocol in Amazon Bedrock AgentCore Runtime, enabling communication between agents built on different frameworks. The protocol allows agents developed with Strands Agents, OpenAI Agents SDK, LangGraph, Google ADK, or Claude Agents SDK
Agentic systems require several foundational components to operate effectively. Memory functions at two levels: short-term memory maintains conversation context within active sessions, while long-term memory retains insights across multiple sessions over time
MCP solves the agent-to-resource connection problem, while A2A solves the agent-to-agent communication problem in multi-agent deployments.
The A2A protocol’s stateful behavior lets agents remember recent interactions and maintain coherent conversations. Session smuggling attack exploits this property to inject malicious instructions into a conversation, hiding them among otherwise benign client requests and server responses.

Demystifying Determinism in Durable Execution

function that performs some side effects, such as writing to a database, making an API call, sending an email etc, and makes it reliable via recovery (which in turn depends on durability).
This becomes equivalent to jumping to the first unexecuted step and resuming from there.
Re-execution of the control flow requires determinism: it must execute based on the same decision state every single time and it must also pass the same arguments to side effect code every single time. However, side effects themselves do not need to be deterministic, they only require idempotency or duplication tolerance.
The side effects absolutely can and should be non-deterministic, which is fine because they should generally only be executed once, even if the function itself is executed many times. For those failure cases where the result is not durably stored, we rely on idempotency or duplication tolerance.

Agent Design Is Still Hard by Armin Ronacher

The Dark Data Tax: How Hoarding is Poisoning Your AI by Ananth Packkildurai

Technology Radar | Guide to technology landscape

New prompt injection papers: Agents Rule of Two and The Attacker Moves Second by Simon Willison

A Fork in the Road: Deciding Kafka’s Diskless Future — Jack Vanlightly by Jack Vanlightly

Nov 30, 2025

Amazon Bedrock introduces Reserved Service tier

workloads requiring predictable performance and guaranteed tokens-per-minute capacity
The Reserved tier targets 99.5% uptime for model response and is available today for Anthropic Claude Sonnet 4.5

Kafka is Fast - I’ll use Postgres

Nov 24, 2025

Enough With All The Raft

You Should Write An Agent

Take “sub-agents”. People make a huge deal out of Claude Code’s sub-agents, but you can see now how trivial they are to implement: just a new context array, another call to the model. Give each call different tools. Make sub-agents talk to each other, summarize each other, collate and aggregate. Build tree structures out of them. Feed them back through the LLM to summarize them as a form of on-the-fly compression, whatever you like.

Code research projects with async coding agents like Claude Code and Codex by Simon Willison

Nov 23, 2025

Google Launches Code Wiki, an AI-Driven System for Continuous, Interactive Code Documentation

Amazon S3 now supports attribute-based access control

New AWS Well-Architected Lenses for AI and ML workloads

Nov 22, 2025

Event Streaming is Topping Out by Stanislav Kozlovski

Nov 21, 2025

AWS CloudFormation accelerates dev-test cycle with early validation and simplified troubleshooting

Nov 19, 2025

Waiting for SQL:202y: GROUP BY ALL by Peter Eisentraut

Thinking About Thinking With LLMs

In the case of this article, most people agreed that using LLMs to learn new things needs to be done with some caution.
t’ll lead to the continued democratization of programming and ever more people in conversation with computers.
But I don’t think this changes the fundamental reality that the best programmers aren’t the ones that make the widest use of the highest abstractions. That’ll continue to be those who dig down and understand what’s happening at a deeper level — that understanding will always lead to more deft use of any tools that programmers have at their disposal.

Nov 18, 2025

Why don’t jet engines melt?

Nov 17, 2025

The Rise of Subagents by Philipp Schmid

Context Engineering is everything.
Models are improving very fast. Don’t over-engineer a solution today that a simpler or better model can solve tomorrow.

Nov 16, 2025

How To Not Plateau When Learning Python

I Stopped Using AI To Code For 30 Days

Trust is everything by Pedro Tavares

Code execution with MCP: Building more efficient agents by Simon Willison

identifies two challenges with MCP as it exists today. The first has been widely discussed before: all of those tool descriptions take up a lot of valuable real estate in the agent context even before you start using them. The second is more subtle but equally interesting: chaining multiple MCP tools together involves passing their responses through the context, absorbing more valuable tokens and introducing chances for the LLM to make additional mistakes.

You’re Passing Way Too Many Arguments (and How to Fix It)

Nov 14, 2025

October 2025 (version 1.106)

Why You Shouldn’t Trust Confident People

Nov 9, 2025

Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale by Simon Willison

Oct 31, 2025

From VS Code to Helix by Thibault Martin

Introducing Agent HQ: Any agent, any way you work by Kyle Daigle

Corrosion

DeepSeek AI Unveils DeepSeek-OCR: Vision-Based Context Compression Redefines Long-Text Processing

DeepSeek-OCR comprises two key components: the DeepEncoder for visual compression and the DeepSeek3B-MoE-A570M as the decoder. It achieves 97% OCR precision with a compression ratio of less than 10×, condensing ten text tokens into one visual token

Production RAG: what I learned from processing 5M+ documents

Oct 28, 2025

Setting up a codebase for working with coding agents by Simon Willison

Measuring Engineering Productivity

When you burden engineers with measurement overhead, you reduce productivity instead of improving it and create enemies

The Case for an Iceberg-Native Database: Why Spark Jobs and Zero-Copy Kafka Won’t Cut It

Oct 27, 2025

Living dangerously with Claude by Simon Willison

Why a ‘Boring’ Life Might Be the Happiest One by Karun Pal

Hugging Face Introduces RTEB, a New Benchmark for Evaluating Retrieval Models

That’s the be-all-end-all, but not everyone has that data ready. If you can, though: use your own tests. E.g. Sentence Transformers allows for easily swapping out models.

LLMs and the Lessons We Still Haven’t Learned by Jampa Uchoa

The best use of AI is to trivialize hard problems

Oct 23, 2025

Expanding Model Choice in VS Code with Bring Your Own Key

Oct 22, 2025

Turn off Cursor, turn on your mind

As the system grows denser and more complicated, and as AI compresses the time you spend thinking about it, you slowly lose control.

Oct 21, 2025

Data News — dbt Coalesce 2025

Current AI Models have 3 Unfixable Problems

Oct 20, 2025

AI Trading in Real Market

A small number of samples can poison LLMs of any size

a joint study with the UK AI Security Institute and the Alan Turing Institute, we found that as few as 250 malicious documents can produce a “backdoor” vulnerability in a large language model—regardless of model size or training data volume

Claude Skills are awesome, maybe a bigger deal than MCP by Simon Willison

Vibe engineering by Simon Willison

Automated testing. If your project has a robust, comprehensive and stable test suite agentic coding tools can fly with it. Without tests? Your agent might claim something works without having actually tested it at all, plus any new change could break an unrelated feature without you realizing it. Test-first development is particularly effective with agents that can iterate in a loop
Write good documentation first and the model may be able to build the matching implementation from that input alone.
A culture of code review. This one explains itself. If you’re fast and productive at code review you’re going to have a much better time working with LLMs than if you’d rather write code yourself than review the same thing written by someone (or something) else

Code Mode: the better way to use MCP by Our team

Latency Numbers Every Data Streaming Engineer Should Know

What “Supporting Our AI Overlords” and “Semantic Spacetime” Tell Us About the Future of Data Infrastructure by Ananth Packkildurai

That requires an entirely new transaction model, one where branching is cheap and rollback is almost free

Oct 19, 2025

The AI water issue is fake by Simon Willison

Oct 13, 2025

Announcing DuckDB 1.4.1 LTS by The DuckDB team

Oct 12, 2025

TIL: Testing different Python versions with uv with-editable and uv-test by Simon Willison

Python 3.14.0 is now available

Surviving Long-Running Projects by Pedro Tavares

Oct 11, 2025

Notes on switching to Helix from vim by Julia Evans

Benchmark Results for DuckDB v1.4 LTS by The DuckDB team

Python 3.14 Is Here. How Fast Is It? by Simon Willison

Oct 8, 2025

Dreamer 4: Learning to Achieve Goals from Offline Data Through Imagination Training

Researchers from Google DeepMind have recently described a new approach for teaching intelligent agents to solve complex, long-term tasks by training them exclusively on video footage rather than through direct interaction with the environment
To make the dynamics model more efficient, the researchers employed shortcut forcing, training the model to take larger steps when predicting future frames without losing accuracy

Oct 7, 2025

Cohere’s Embed v4 multimodal embeddings model now available on Amazon Bedrock

Oct 5, 2025

Cross-Agent Privilege Escalation: When Agents Free Each Other by Simon Willison

Oct 2, 2025

Announcing Amazon ECS Managed Instances

Oct 1, 2025

Designing agentic loops by Simon Willison

Introducing Claude Sonnet 4.5

Hugging Face Introduces mmBERT, a Multilingual Encoder for 1,800+ Languages

mmBERT builds on the ModernBERT architecture, inheriting its fast, memory-efficient backbone with Flash Attention 2 and unpadded sequence processing, allowing for 8,192-token contexts.
mmBERT demonstrates that scaling multilingual encoders does not have to come at the cost of efficiency. By balancing coverage with targeted improvements, it sets a new baseline for retrieval, classification, and cross-lingual tasks.

AI is writing 90% of the code

I still treat every line as my responsibility, judged as if I wrote it myself. AI doesn’t change that.

Video models are zero-shot learners and reasoners by Simon Willison

Veo 3 shows early forms of “chain-of-frames (CoF)” visual reasoning like maze and symmetry solving

Google Introduces VaultGemma: An Experimental Differentially Private LLM

AWS CDK Refactor Feature: Safe Infrastructure as Code Renaming

Sep 30, 2025

Claude Sonnet 4.5 is probably the “best coding model in the world” (at least for now) by Simon Willison

PostgreSQL 18 Released

Sep 28, 2025

How to stop AI’s “lethal trifecta” by Simon Willison

Sep 21, 2025

Unlock the power of Apache Iceberg v3 deletion vectors on Amazon EMR

Sep 20, 2025

Learn Your Way: Reimagining textbooks with generative AI

Sep 19, 2025

Boring is good – Scott Jenson by Scott Jenson

DuckLake 0.3 with Iceberg Interoperability and Geometry Support by Guillermo Sanchez, Gabor Szarnyas

In Praise of Idleness by Bertrand Russell

Sep 18, 2025

Hugging Face Brings Open-Source LLMs to GitHub Copilot Chat in VS Code

Sep 17, 2025

What’s new in Python 3.14

Announcing DuckDB 1.4.0 by The DuckDB team

When Dimensions Change Too Fast for Iceberg by Ananth Packkildurai

Distributed Data Systems: Understanding Join Algorithms

SMJ is the default and the most common join strategy for joining two large tables. It is highly scalable and memory-safe due to its ability to spill sorted runs to disk. It is also robust to data skew as the shuffle phase redistributes data evenly across executors (especially when combined with Adaptive Query Execution AQE). However, it is computationally intensive due to the sorting step and incurs significant network I/O during the shuffle phase

Guardians of the Agents - ACM Queue by Erik Meijer

Sep 16, 2025

Understanding Apache Fluss by Jack Vanlightly

It should probably eventually be seen as an extension to Paimon (et al) more than a table storage engine made specially for Flink.
Schema evolution is not yet supported, and lifecycle management remains limited to simple TTL-based policies rather than being tied to lakehouse tiering progress.

A Conceptual Model for Storage Unification by Jack Vanlightly

Article: Building Reproducible ML Systems with Apache Iceberg and SparkSQL: Open Source Foundations

Sep 15, 2025

Iceberg, The Right Idea - The Wrong Spec

Iceberg suffers from many problems, it:

Uses O(n) operations to add new metadata to an existing table - where it should have used O(1) or O(log(n)) Cannot handle cross table commits Relies on file formats that are bloated and ineffective Tries to be a “file only” format, but still needs a database to operate Does not handle fragmentation and metadata bloat and remains silent about the real complexities of that problem Does not handle row level security or security at all for that matter Fundamentally does not scale - because it uses a bad, optimistic concurrency model Is entirely unfit for trickle feeding of data - a hallmark feature of large Data Lakes Moves an extraordinary amount of complexity (and trust) to the client talking to it Makes proper caching and query planning (the hallmarks of good analytics) very difficult, if not impossible Has all the hallmarks of something being designed by committee, completely lacking elegance

Sep 13, 2025

Presentation: The Data Backbone of LLM Systems

Basic Feature Engineering with DuckDB by Petrica Leuca

Sep 12, 2025

How big are our embeddings now and why? by Vicki Boykis

More People ≠ Faster Delivery by Pedro Tavares

On the Edge of Competence by Pedro Tavares

Sep 11, 2025

From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix by Netflix Technology Blog

Big Data on the Move: DuckDB on the Framework Laptop 13 by Gábor Szárnyas

Why language models hallucinate

Revisiting Medallion Architecture by Ananth Packkildurai

Git exclude, a handy feature you might not know about by Marijke Luttekes

Python has had async for 10 years — why isn’t it more popular?

PydanticAI: the AI Agent Framework Winner

What even is distributed systems

The Future is NOT Self-Hosted

What is Apache Arrow Flight, Flight SQL & ADBC? - Dipankar Mazumdar - Medium by Dipankar Mazumdar

Aug 31, 2025

Why AI Isn’t Ready to Be a Real Coder by Rina Diane Caballar

Aug 18, 2025

How I accidentally became PureGym’s unofficial Apple Wallet developer by Vadim Drobinin

Jul 25, 2025

Run coverage on tests by Hugo van Kemenade

Android Earthquake Alerts: A global system for early warning

Qwen3-Coder: Agentic Coding in the World by Qwen Team

Amazon Launches Bedrock AgentCore for Enterprise AI Agent Infrastructure

The Cost of Being Wrong by Jack Vanlightly

Jul 23, 2025

He rewrote everything in Rust, then we got fired

AWS Lambda bridges console to VS Code for unified serverless development experience

Iceberg, The Right Idea - The Wrong Spec

Jul 15, 2025

Ollama or vLLM? How to choose the right LLM serving tool for your use case | Red Hat Developer

Jul 14, 2025

Deliveroo’s Machine Learning Platform: Powering the Future of ML by Saikrishna Desaraju

Unstructured Data Management at Scale - Piethein Strengholt - Medium by Piethein Strengholt

This layer focuses on enhancing the precision and specificity of the data to ensure it aligns perfectly with the intended use.

Jul 13, 2025

Not So Fast: AI Coding Tools Can Actually Reduce Productivity by Steve Newman

Jul 12, 2025

Introducing Northguard and Xinfra: scalable log storage at LinkedIn

This also means that unlucky combinations of segments landing on a broker aren’t an issue, as it will sort itself out when new segments are created and assigned to other brokers. The cluster balances on its own.
Ranges offer a better solution for stream processing, as Northguard’s buddy-style ranges of different topics inherently align. We can avoid the shuffle step entirely

Why Apache Spark is often considered as slow? by Sem Sinchenko

Single Instruction Multiple Data, or SIMD.

Jul 11, 2025

Lakehouse 2.0: The Open System That Lakehouse 1.0 Was Meant to Be | Part 1 by Modern Data 101

The Contradiction of Vision & RealityThe

Jul 10, 2025

How I build software quickly

Here’s a summary of things I’ve learned about building software quickly:Know how good your code needs to be for the task at hand.Start with a rough draft/spike.Try to soften requirements if you can.Don’t get distracted.Make small changes.Practice specific skills.

The New Skill in AI is Not Prompting, It’s Context Engineering by Philipp Schmid

Context Engineering is new term gaining traction in the AI world. The conversation is shifting from “prompt engineering” to a broader, more powerful concept: Context Engineering. Tobi Lutke describes it as “the art of providing all the context for the task to be plausibly solvable by the LLM.” and he is right

How we built our multi-agent research system

Jul 8, 2025

Git experts should try Jujutsu

Building Effective AI Agents

Workflow: Prompt chainingPrompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see “gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.
Workflow: RoutingRouting classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.
Workflow: ParallelizationLLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:Sectioning: Breaking a task into independent subtasks run in parallel.Voting: Running the same task multiple times to get diverse outputs.
Workflow: Orchestrator-workersIn the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.
Workflow: Evaluator-optimizerIn the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.

Jul 6, 2025

When AI Codes, What’s Left for me? by Adam Gordon Bell

Jul 4, 2025

Introducing Apache Spark 4.0

Jul 1, 2025

Tools I love: mise(-en-place)

Jun 28, 2025

Marimo: A Notebook that “Compiles” Python for Reproducibility and Reusability - Akshay Agrawal by PyCon US

Jun 27, 2025

Just make it scale: An Aurora DSQL story by Dr Werner Vogels - https://www.allthingsdistributed.com/

This doesn’t mean that Rust is right for every project. Modern Java implementations like JDK21 offer great performance that is more than enough for many services. The key is to make these decisions the same way you make other architectural choices: based on your specific requirements, your team’s capabilities, and your operational environment. If you’re building a service where tail latency is critical, Rust might be the right choice. But if you’re the only team using Rust in an organization standardized on Java, you need to carefully weigh that isolation cost. What matters is empowering your teams to make these choices thoughtfully, and supporting them as they learn, take risks, and occasionally need to revisit past decisions. That’s how you build for the long term.

Jun 25, 2025

Debunking the Myth of Join Ordering: Toward Robust SQL Analytics

rediscover the recent Predicate Transfer technique from a robustness point of view

Jun 20, 2025

Keynote Speaker - Cory Doctorow by PyCon US

My AI Skeptic Friends Are All Nuts

Jun 18, 2025

A receipt printer cured my procrastination [ADHD] by Laurie Hérault

Jun 16, 2025

The Open Table Format Revolution: Why Hyperscalers Are Betting on Managed Iceberg

DuckLake: SQL as a Lakehouse Format by Mark Raasveldt and Hannes Mühleisen

Jun 14, 2025

Frequent reauth doesn’t make you more secure

Jun 4, 2025

Google Releases LMEval, an Open-Source Cross-Provider LLM Evaluation Tool

May 29, 2025

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

May 27, 2025

Making Sense of Apache Iceberg Statistics - Yuval Yogev - Medium by Yuval Yogev

data-level (what’s inside the files) and metadata-level (how the files are organized)
Theta sketch — a probabilistic data structure for estimating NDV (number of distinct values) for each column.

Grey box and outcome driven engineering

But how can you trust code you haven’t reviewed?” I hear the traditionalists cry.Simple: I don’t trust the code. I trust the verification process.
In contrast, the AI-assisted Grey Box paradigm completely separates validation from implementation. You define what correctness looks like through expected outcomes and verification criteria, but delegate both test implementation and solution implementation to AI systems
There are still times when opening the box makes sense:When verification fails in unexpected waysWhen you need to modify the implementation for a new use caseWhen you’re curious about how something worksWhen you need to teach others

14 Advanced Python Features

Quick bonus trick: As you probably saw, Python also has support for String Literals. These help assert that only specific string values can be passed to a parameter, giving you even more type safety. Think of them like a lightweight form of Enums!
type Vector = list[float]
That’s where Protocols come in. Protocols (also known as Structural Subtyping) are typing classes in Python defining the structure or behavior that classes can follow without the use of interfaces or inheritance.

May 25, 2025

Announcing a new IDE for PostgreSQL in VS Code from Microsoft

Everything You Need to Know About Incremental View Maintenance by Chris

May 24, 2025

Decibels are ridiculous by lcamtuf

Claude and I write a utility program

May 23, 2025

Announcing DuckDB 1.3.0 by The DuckDB team

How to Evaluate AI that’s Smarter than Us - ACM Queue by Chip Huyen

Functional correctness – evaluating AI by how well it accomplishes its intended tasks.
AI-as-a-judge – using AI instead of human experts to evaluate AI outputs.
Comparative evaluation – evaluating AI systems in relationship with each other instead of independently.
MMLU (Massive Multitask Language Understanding
For some applications, figuring out evaluation can take up the majority of the development effort.
Because evaluation is difficult, many people settle for word of mouth (e.g., someone says that model X is good) or eyeballing the results (also known as vibe check). This creates even more risk and slows down application iteration. Instead, an investment in systematic evaluation is needed to make the results more reliable
your task can be evaluated by functional correctness, that’s what you should do
Limitations of AI-as-a-judge
Despite the many advantages of AI-as-a-judge, some teams are hesitant to adopt this approach. Using AI to evaluate AI seems tautological. The probabilistic nature of AI makes it seem too unreliable to act as an evaluator. AI judges can potentially introduce nontrivial costs and latency to an application. Given these limitations, some teams see AI-as-a-judge as a fallback option when they don’t have any other way of evaluating their systems
One big question is whether the AI judge needs to be stronger than the model being evaluated
Using a model to judge itself—self-evaluation or self-critique—sounds like cheating, especially because of self-bias. Self-evaluation can be great for sanity checks, however. If a model thinks its response is incorrect, the model might not be that reliable. Beyond sanity checks, asking a model to evaluate itself can nudge the model to revise and improve its responses
With comparative evaluation, you evaluate models against each other and compute a ranking from comparison results. For responses whose quality is subjective, comparative evaluation is typically easier to do than pointwise evaluation

Reducing Runtime Errors in Spark: Why We Migrated from DataFrame to Dataset by Agoda Engineering

Off-heap memory refers to memory that is managed outside the JVM heap. It is directly allocated and managed by Spark, bypassing the JVM’s garbage collection (GC) mechanism
Tungsten avoids creating individual JVM objects for each row or column. Instead, it uses a binary format to represent data in memory.For example, instead of creating an object for each record, Spark stores the data as a contiguous block of memory in a serialized format, which is faster to process.
How Dataset Reduces Runtime Errors
Dataset provides type safety at compile time, while DataFrame does not.
No Hard-Coded Column Names
Schema Awareness and Readability
Encoders for Optimized Serialization/Deserialization
Encoders generate schema-specific bytecode, which is then compiled into JVM bytecode. This bytecode is highly optimized for execution.The JVM’s Just-In-Time (JIT) compiler dynamically compiles frequently executed code paths (hot code) into native machine code at runtime, further improving performance.

May 22, 2025

Three AI Design Patterns of Autonomous Agents by Alexander Sniffin

ReAct AgentTask-Planner AgentMulti-Agent Orchestration
The benefits of implementing the agent this way are:predictabilitytasks are isolated from other stateseasy to troubleshooteasy to add new states
Potential problems include:prone to getting stuck in loopscan get side tracked or lose focus from the original request

May 21, 2025

How One Company Secretly Poisoned The Planet by Veritasium

May 14, 2025

Ensuring Data Contracts adoption across an organization by Pierre-Yves BONNEFOY

Everything Wrong with MCP by Shrivu Shankar

MCP initially didn’t define an auth spec and now that they have people don’t like it.
MCP has no concept or controls for tool-risk levels.
MCP has no concept or controls for costs.
MCP makes it easier to accidentally expose sensitive data.

May 9, 2025

What’s new in pip 25.1 - Dependency groups!

Powering Apache Pinot ingestion with Hoptimator

Hoptimator was developed to empower data consumers to create and control their own data pipelines. At LinkedIn, Hoptimator powers subscriptions, which represent an ongoing request for data. Data consumers can create, modify, and delete subscriptions dynamically via the Subscription API. This service leverages Hoptimator to orchestrate end-to-end multi-hop data pipelines that deliver the requested data

The One-Person Framework in practice

May 8, 2025

Throwing it all away - how extreme rewriting changed the way I build databases

May 7, 2025

FastAPI-MCP: Simplifying the Integration of FastAPI with AI Agents

May 6, 2025

AWS Introduces MCP Servers for AI-Assisted Cloud Development

AI assisted search-based research actually works now by Simon Willison

May 4, 2025

Understanding the recent criticism of the Chatbot Arena by Simon Willison

Making PyPI’s test suite 81% faster

Apr 19, 2025

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog | Amazon Web Services

Apr 16, 2025

Technology Radar | Guide to technology landscape

Apr 13, 2025

CaMeL offers a promising new direction for mitigating prompt injection attacks by Simon Willison

So, are prompt injections solved now?
No, prompt injection attacks are not fully solved. While CaMeL significantly improves the security of LLM agents against prompt injection attacks and allows for fine-grained policy enforcement, it is not without limitations.
Importantly, CaMeL suffers from users needing to codify and specify security policies and maintain them. CaMeL also comes with a user burden. At the same time, it is well known that balancing security with user experience, especially with de-classification and user fatigue, is challenging
CaMeL really does represent a promising path forward though: the first credible prompt injection mitigation I’ve seen that doesn’t just throw more AI at the problem and instead leans on tried-and-proven concepts from security engineering, like capabilities and data flow analysis.

How Meta is Using a New Metric for Developers: Diff Authoring Time

Diff Authoring Time (DAT)
By tracking the time from the initiation of a code change to its submission, DAT offers insights into the efficiency of the development process and helps identify areas for improvement.
For instance, DAT has been instrumental in evaluating the impact of introducing a type-safe mocking framework in Hack, leading to a 14% improvement in authoring time. Additionally, the development of automatic memoization in the React compiler resulted in a 33% improvement, and efforts to promote code sharing have saved thousands of DAT hours annually, achieving over a 50% improvement.

Maximizing Your Delta Scan Performance in DuckDB by Sam Ansmink

Apr 12, 2025

The 2025 AI Index Report | Stanford HAI

Is not writing tests unprofessional?

The Best Programmers I Know | Matthias Endler by Matthias Endler

Read the Reference
Don’t go to Stack Overflow, don’t ask the LLM, don’t guess, just go straight to the source. Oftentimes, it’s surprisingly accessible and well-written
Know Your Tools Really Well
For example, if you are a backend engineer and you make heavy use of Kafka, I expect you to know a lot about Kafka – not just things you read on Reddit. At least that’s what I expect if you want to be one of the best engineers
Read The Error Message
Turns out, if you just sit and meditate about the error message, it starts to speak to you
Break Down Problems
you work as a professional developer, that is the bulk of the work you get paid to do: breaking down problems. If you do it right, it will feel like cheating: you just solve simple problems until you’re done.
Don’t Be Afraid To Get Your Hands Dirty
Before you know it, they become the go-to person in the team for whatever they touched. Mostly because they were the only ones who were not afraid to touch it in the first place.
Always Help Others
A related point. Great engineers are in high demand and are always busy, but they always try to help. That’s because they are naturally curious and their supportive mind is what made them great engineers in the first place. It’s a sheer joy to have them on your team, because they are problem solvers.
Write
Most awesome engineers are well-spoken and happy to share knowledge.The best have some outlet for their thoughts: blogs, talks, open source, or a combination of those.
Never Stop Learning
Have Patience
You need patience with computers and humans. Especially with yourself
Never Blame the Computer
The best keep digging until they find the reason. They might not find the reason immediately, they might never find it, but they never blame external circumstances.
Don’t Be Afraid to Say “I Don’t Know”

Apr 10, 2025

Data Wants to Be Free: Fast Data Exchange with Apache Arrow by David Li, Ian Cook, Matt Topol

So generally, you can use Arrow data as-is without having to parse every value.
By using Arrow for serialization, data coming off the wire is already in Arrow format, and can furthermore be directly passed on to DuckDB, pandas, polars, cuDF, DataFusion, or any number of systems.
Arrow IPC defines how to serialize and deserialize Arrow data
And finally, ADBC actually isn’t a protocol. Instead, it’s an API abstraction layer for working with databases (like JDBC and ODBC—bet you didn’t see that coming), that’s Arrow-native and doesn’t require transposing or converting columnar data back and forth. ADBC gives you a single API to access data from multiple databases, whether they use Flight SQL or something else under the hood, and if a conversion is absolutely necessary, ADBC handles the details so that you don’t have to build out a dozen connectors on your own.

Beyond thumbs up and thumbs down: A human-centered approach to evaluation design for LLM products by shima ghassempour

For example, a model may generate output that could technically be accurate, but those suggestions may not always be useful to the people or processes they support. Improving workflows or process efficiency may occur at different stages and require different metrics.
Ensure the realization of intended business value by aligning system outputs and performance to real world expectations.Build trust with end users and stakeholders by ensuring solution reliability in diverse scenarios.Identify areas for performance improvement by pinpointing systematic gaps in model performance.Improve user satisfaction by allowing end users to provide feedback, and refining responses accordingly.Adapt to real-world use cases and ensure stable system performance over time, as the iterative nature of evaluation helps it to remain relevant within changing and unexpected real-world conditions and to adjust promptly.
Lack of granularity in feedback
Binary feedback often fails to capture why a response was unsatisfactory — whether it lacked accuracy, completeness, or the right tone
Accounting for variation in human judgment:
Collecting human feedback without appropriate context and judgment can introduce variability that is difficult to interpret and understand
Bias in feedback
Emotions, prior experiences, and context may influence feedback, leading to skewed data. For example, feedback from someone with 10 years of experience versus someone new to the job may vary significantly, influencing evaluation outcomes
Automated and human-in-the-loop evaluation
Combining automated evaluation metrics (e.g., BLEU, ROUGE, or perplexity scores) with human feedback provides a holistic view on system performance. Periodic human-in-the-loop testing ensures that the model meets quality standards before deployment.
A/B testing to compare new model versions or evaluation designs and see which delivers better outcomes.Gradual rollout, which is when new model performance versions are released to a small portion of users and performance metrics are closely monitored.Shadow mode release to allow evaluation of the model on real scenarios without exposing the outputs to the real users
Testing with users and experts early and often

Develop and test AWS Glue 5.0 jobs locally using a Docker container | Amazon Web Services

Apr 9, 2025

How to Write Blog Posts that Developers Read

Upskilling data engineers | Georg Heiler by Georg Heiler

How AI will disrupt data engineering as we know it

Hardly. Data engineers, one of the hottest jobs of the last decade, will stay hot. But practitioners will be pushed in one of three directions: towards the business domain, towards automation, or towards the underlying data platform.Data platform engineers will become ever-more-important. They don’t spend their time building pipelines, but rather on the infrastructure that pipelines are built on. They are responsible for performance, quality, governance, uptime.Automation engineers will sit side-by-side with data teams and take the insights coming out of data and build business automations around it. As a data leader recently told me: “I’m no longer in the business of insights. I’m in the business of creating action.”Data engineers that are primarily obsessed with business outcomes will have ample opportunity to act as enablement and support for the insight-generation process, from owning and supporting datasets to liaising with stakeholders. The value to the business won’t change, but the way the job is done will.

Can we make AI less power-hungry? These researchers are working on it. by Jacek Krywko

Apr 8, 2025

AI 2027

Advanced RAG Techniques: What They Are & How to Use Them by Guy Korland

Semantic chunking is a method of dividing text into segments based on their inherent meaning rather than adhering to predetermined lengths. This ensures that each segment, or “chunk,” encapsulates a complete and meaningful portion of information.
GraphRAG applications already utilize this technique which contributes to their effectiveness. In these systems, the LLM translates the user query into knowledge graph Cypher queries, which are then used to query the knowledge graph and retrieve relevant information.
The retrieval process begins with broader chunks or parent nodes, followed by a more focused search within smaller chunks or child nodes linked to the selected parent nodes. Hierarchical indexing not only improves retrieval efficiency but also minimizes the inclusion of irrelevant data in the final output
Self-query retrieval is a technique in which the language model (LLM) generates follow-up queries based on the initial user query to retrieve more precise information. For example, this method allows for the extraction of metadata from the user’s query, enabling a search over filtered data to achieve more accurate results.
Refining Through Data: By carefully curating data and using it to train the model, you enable it to differentiate more effectively between relevant and irrelevant information. This process sharpens the model’s ability to retrieve the most pertinent results.Better Performance Metrics: The outcome of fine-tuning is a marked improvement in retrieval accuracy and efficiency, facilitating better user experiences and decision-making
Reranker models, such as Cohere’s Rerank3, are specialized AI models that assess and prioritize the relevance of these retrieved documents in relation to a user’s query. Operating on a smaller set of candidate documents, these models focus on fine-tuning rankings based on the context of both the query and the documents. Typically trained on datasets containing examples of relevant and irrelevant documents, rerankers can effectively distinguish high-quality results from less relevant ones.
To refine your search results, you can employ Corrective Retrieval-Augmented Generation (Corrective RAG or CRAG). This technique involves scoring and filtering the retrieved documents based on their relevance and accuracy concerning your query.
Chain-of-thought prompting is particularly effective when dealing with complex queries, where the LLM needs to reason to generate the final response. Frameworks like DSPy are particularly capable of Chain-of-Thought prompting
LangChain, LlamaIndex, and DSPy, which offer powerful modules to help you integrate these advanced RAG strategies into your workflows.

Comparing Open-Source AI Agent Frameworks

Developers who prefer to model AI tasks in stateful workflows often gravitate toward LangGraph. If your application demands robust task decomposition, parallel branching, or the ability to inject custom logic at specific stages, you might find LangGraph’s explicit approach a good fit.
you are already deep into OpenAI’s stack and want an officially supported solution to spin up agents that utilize GPT-4o or GPT-o3, the OpenAI Agents SDK might be your first stop.

A Field Guide to Rapidly Improving AI Products – Hamel’s Blog by Hamel Husain

This is a pattern I’ve seen repeatedly: teams build evaluation systems, then gradually lose faith in them. Sometimes it’s because the metrics don’t align with what they observe in production. Other times, it’s because the evaluations become too complex to interpret. Either way, the result is the same – the team reverts to making decisions based on gut feeling and anecdotal feedback, undermining the entire purpose of having evaluations.
The teams that maintain trust in their evaluation systems embrace this reality rather than fighting it. They treat evaluation criteria as living documents that evolve alongside their understanding of the problem space. They also recognize that different stakeholders might have different (sometimes contradictory) criteria, and they work to reconcile these perspectives rather than imposing a single standard.
The most successful teams take a more measured approach:

Start with high human involvement: In the early stages, have domain experts evaluate a significant percentage of outputs. Study alignment patterns: Rather than automating evaluation, focus on understanding where automated evaluations align with human judgment and where they diverge. This helps you identify which types of cases need more careful human attention. Use strategic sampling: Rather than evaluating every output, use statistical techniques to sample outputs that provide the most information, particularly focusing on areas where alignment is weakest. Maintain regular calibration: Even as you scale, continue to compare automated evaluations against human judgment regularly, using these comparisons to refine your understanding of when to trust automated evaluations.

Instead of defining success as shipping a feature, the capability funnel breaks down AI performance into progressive levels of utility. At the top of the funnel is the most basic functionality – can the system respond at all? At the bottom is fully solving the user’s job to be done. Between these points are various stages of increasing usefulness. For example, in a query assistant, the capability funnel might look like: 1. Can generate syntactically valid queries (basic functionality) 2. Can generate queries that execute without errors 3. Can generate queries that return relevant results 4. Can generate queries that match user intent 5. Can generate optimal queries that solve the user’s problem (complete solution)
The most successful teams I’ve worked with structure their roadmaps around experiments rather than features. Instead of committing to specific outcomes, they commit to a cadence of experimentation, learning, and iteration.
Perhaps the most counterintuitive aspect of this approach is the emphasis on learning from failures. In traditional software development, failures are often hidden or downplayed. In AI development, they’re the primary source of learning
This pattern – long periods of apparent failure followed by breakthroughs – is common in AI development. Traditional feature-based roadmaps would have killed the project after months of “failure,” missing the eventual breakthrough.

Parquet Bloom Filters in DuckDB by Hannes Mühleisen

In the end we end up with a lot of additional reads to find and read the Bloom filter bytes, in principle requiring a careful trade-off between reading the filters and “just” reading the column brute-force.
During reading, DuckDB will automatically use constant-comparison filter predicates in the query (e.g., WHERE a = 42) to probe the Bloom filter (if present) and skip row groups where the Bloom filter can guarantee there are no matching rows in the group. Again, this happens transparently to users and there is no configuration that needs to be set.

Apr 7, 2025

How engineers can use one-on-ones with their manager to accelerate career growth by Dalia Abuadas

One-on-one meetings with your manager are one of the most valuable tools you have for career growth, problem-solving, and unlocking new opportunities. So if you’re only using them to provide status updates, you’re leaving a lot on the table.
Here are a few ideas that stuck with me:

Your manager isn’t a mind reader.
You can’t expect guidance if you don’t come with a direction.
Your growth is a shared effort, but it starts with you.

Exploring Aging of Programmers: Fostering Inclusive and Age-friendly Workplaces

Gregory advised to make friends, and don’t stop making new friends all of your life. Try new things too - as life goes on there will be losses, and the only cure for loss is gain, so you have to give new hobbies, new foods, and new entertainment a chance. Some of them will work out wonderfully, she said.

Query Engines: Gatekeepers of the Parquet File Format by Laurens Kuiper

Streaming AI Agents: Why Kafka and Flink are the foundations of AI at scale

Unlocking graph analytics in DuckDB with SQL/PGQ by DuckDB

Apr 6, 2025

Amazon SES now offers attachments in sending APIs - AWS

Zero config debugging with Deno and OpenTelemetry by Deno

The Pragmatic Open Source Contributor

They don’t think it’s part of their job. Hopefully I’ve made a brief but decent case for why this is important, both for the wider community, and for your own growth. Familiarity and confidence in this process empowers you to blast through technical barriers, as you might no longer be “blocked” from achieving your goals due to some underlying third-party code not supporting XYZ.

Apr 3, 2025

AI Agents: Less Capability, More Reliability, Please

The Power of Asymmetric Experiments @ Meta - Analytics at Meta - Medium by Analytics at Meta

Asymmetric experiments make sense when (1) you don’t need to run many experiments in parallel, (2) recruiting people for the experiment is cheap, and (3) running the test intervention is expensive
Asymmetric experiments make sense when your need for concurrent experiment capacity is low*.
The test intervention is expensive. Asymmetric experiments have a smaller test group. This makes asymmetric experiments appealing when the cost of the test intervention is high. For example, if the test intervention has the possibility of negatively impacting the user experience, or would require an increase in compute costs.

Troubleshooting: The Skill That Never Goes Obsolete

It’s easy to get lost in reactive problem whack-a-mole without stopping to think: what’s the real cause of this issue? What, exactly, is going on here?
Writers are fond of saying that “writing is thinking”. Here are two ways I use writing as a troubleshooting tool:

Rubber duck debugging like a pro: I can often solve my problem by drafting a forum post without posting it. The effort required to articulate the salient details of the system and the problem, without looking dumb, is higher than the effort I have usually put in at the point I decide I need help. Corollary: making a forum post without sounding like I haven’t done my homework also tends to put me over my time/energy budget for solving a seemingly-trivial problem.

Behold the trail of crumbs: I find that writing and diagramming, while helpful for many troubleshooting projects, are essential for multi-session troubleshooting projects. I overestimate how much I will remember about the context, as well as how soon I will get around to continuing the project. A troubleshooting notes file, no matter how obvious or incomplete the information in it seems at the time I write it, leaves a trail of crumbs that I can follow next time. (I have often repeated, verbatim, an entire troubleshooting process, found the problem — and then remembered I troubleshot the exact system, and arrived at the same conclusion, years ago; but there was some hiccup, and I failed to order or install the new part.)

Mar 30, 2025

Build Bigger With Small Ai: Running Small Models Locally by MotherDuck

Mar 26, 2025

Who needs GitHub Copilot when you can roll your own AI code assistant at home

Mar 24, 2025

Amazon Nova expands Tool Choice options for Converse API - AWS

Auto leaves tool selection entirely to Nova’s discretion, whether to call a tool or generate text instead. Auto is useful in use cases like chatbots and assistants where you may need to ask the user for more information, and is the current default. Any prompts Nova to return at least one tool call, from the list of tools specified, while allowing it to choose which tool to use. Any is particularly useful in machine to machine interactions where your downstream components may not understand natural language but might be able to parse a schema representation. Tool enable developers to request a specific tool to be returned by Nova. Tool is particularly useful in forcing a structured output by having a tool that has the return type as your desired output schema.

Mar 23, 2025

Reinventing notebooks as reusable Python programs by akshay, dylan, myles

Highlights from Git 2.49 by Taylor Blau

Preview: Amazon S3 Tables in DuckDB by Sam Ansmink, Tom Ebergen, Gabor Szarnyas

In S3 simplicity is table stakes by Dr Werner Vogels - https://www.allthingsdistributed.com/

As the team started to look at scaling, they created a test account with an enormous number of buckets and started to test rendering times in the AWS Console — and in several places, rendering the list of S3 buckets could take tens of minutes to complete

fleetwood.dev

Reading and writing from memory is extraordinarily slow when compared to computation
Using multiple different approaches, we’ve derived a (non-exhaustive) list of design decisions that should hold for any AI inference accelerator, namely:

Hardware support for low precision data types Design for asynchronous transfers from day 1 Dedicated hardware for tensor aware memory transfers Replace your cache hierarchy with an outsized scratchpad for AI inference For a single accelerator, turn the memory bandwidth up to 11 Design for scale-out from day 1 Dedicated communication hardware should complement compute hardware

Performance of the Python 3.14 tail-call interpreter by Nelson Elhage

Mar 22, 2025

“Vibe Coding” vs Reality

Life Altering Postgresql Patterns

Mar 19, 2025

My Favorite Firefox Extensions

Mar 16, 2025

21 Unique Reasons Why Apache Hudi Should Be Your Next Data Lakehouse | Apache Hudi

Hard-Earned Lessons from a Year of Building AI Agents by Maya Murad

LLMs can be harnessed for higher complexity problem-solving. By combining clever engineering with advanced prompting techniques, models could go beyond what few-shot learning could achieve. Retrieval-Augmented Generation (RAG) was an early example of how models could interact with documents and retrieve factual information. LLMs can also can dynamically interact with a designated environment via tool calling. When combined with chain-of-thought prompting, these capabilities laid the foundation for what would later be known as AI agents.

Resilience Best Practices: How Amazon Builds Well-Behaved Clients and Well-Protected Services

Three operational strategies are suggested: load shedding, auto-scaling, and fairness.
token bucket, leaky bucket, exponentially weighted moving average (EWMA), fixed window, or sliding window.
To avoid making the situation worse for a dependency that is under stress, AWS suggests two patterns for well-behaved clients: circuit breakers, preventing the sustained overload of a dependency, and retries, letting the client retry every request up to N times using exponential backoff with jitter between requests

Prefix Aliases in SQL by Hannes Mühleisen

Gems of DuckDB 1.2 by The DuckDB team

Starting with version 1.2.0, DuckDB supports OR and IN expressions for filter pushdown. This optimization comes especially handy when querying remote

Mar 15, 2025

Graph Databases after 15 Years – Where Are They Headed? by LDBC Linked Data Benchmark Council

In Praise of “Normal” Engineers

A 10x Faster TypeScript - TypeScript by Anders Hejlsberg

Introducing a SQL-based metrics layer powered by DuckDB by DuckDB

Ibis, DuckDB, and GeoParquet: Making geospatial analytics fast, simple, and Pythonic by DuckDB

Mar 9, 2025

GitHub - reloadware/reloadium: Hot Reloading and Profiling for Python by reloadware

Four steps toward building an open source community by Klint Finley

Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses by Ananth Packkildurai

However, many data ingestion tools don’t natively support compaction, requiring manual intervention or dedicated Spark clusters.

AWS CDK Introduces Garbage Collection to Remove Outdated Assets

Mastering Spark: The Art and Science of Table Compaction

DuckDB goes distributed? DeepSeek’s smallpond takes on Big Data by mehdio

Let’s recap the features of smallpond :Lazy evaluation with DAG-based execution – Operations are deferred until explicitly triggered.Flexible partitioning strategies – Supports hash, column-based, and row-based partitioning.Ray-powered distribution – Each task runs in its own DuckDB instance for parallel execution.Multiple storage layer options – Benchmarks have primarily been conducted using 3FS.Cluster management trade-off – Requires maintaining a compute cluster, though fully managed services like Anyscale can mitigate this.Potential 3FS overhead – Self-managing a 3FS cluster introduce significant additional complexity.

Announcing AWS Step Functions Workflow Studio for the VS Code IDE - AWS

Definite: Understanding smallpond and 3FS: A Clear Guide

Smallpond’s distribution leverages Ray Core at the Python level, using partitions for scalability. Partitioning can be done manually, and Smallpond supports:

Hash partitioning (based on column values) Even partitioning (by files or row counts) Random shuffle partitioning

Mar 4, 2025

Data Products: A Case Against Medallion Architecture by Modern Data 101

A non-beginner Data Engineering Roadmap — 2025 Edition by Ernani Castro

Open Source Data Engineering Landscape 2025 - Apache DolphinScheduler - Medium by Apache DolphinScheduler

Further consolidation in the open table format spaceContinued evolution of zero-disk architectures in real-time and transactional systemsQuest toward providing a unified lakehouse experienceThe rise of LLMOps and AI EngineeringThe expansion of the data lakehouse ecosystem in areas such as open catalog integration and development of native librariesThe increasing traction of single-node data processing and embedded analytics

Mar 3, 2025

macOS Tips & Tricks - saurabhs.org

The Commitment Inventory: Get More Done By Saying “No”

Important: When the timer goes off, start the next item immediately, even if you haven’t finished. Otherwise, you’ll get distracted, and your attention will lag
I’m now working within a 50-minute “burst,” and I have 9:56 left before I have to move on to a different checklist within Pitches, so I am cranking out this draft much faster and with more focus than I otherwise would.If your resistance is high, start with a short burst, then add five as you go up. Forster recommends keeping it under 40, but when I’m writing, I use 50-minute bursts. Adapt to fit your needs. When you lose momentum, go back to 5 minutes. You can use bursts for your breaks, too. In 8 minutes, when I finish this burst, I will get a cup of coffee, stretch my legs, pet my dog Duke, and maybe read a quick news article. Forster recommends being strict with the timer, or your attention will drift.

If it is worth keeping, save it in Markdown - Piotr Migdał by Piotr Migdał

Does Ibis understand SQL? – Ibis by Deepyaman Datta

Feb 24, 2025

Concurrency Control in Open Data Lakehouse | Apache Hudi

cases where multiple writer jobs need to access the same table, Hudi supports multi-writer setups. This model allows disparate processes, such as multiple ingestion writers or a mix of ingestion and separate table service jobs to write concurrently.
Apache Iceberg supports multiple concurrent writes through Optimistic Concurrency Control (OCC). The most important part to note here is that Iceberg needs a catalog component to adhere to the ACID guarantees. Each writer assumes it is the only one making changes, generating new table metadata for its operation. When a writer completes its updates, it attempts to commit the changes by performing an atomic swap of the latest metadata.json file in the catalog, replacing the existing metadata file with the new one.

Feb 23, 2025

Towards composable data platforms — Jack Vanlightly by Jack Vanlightly

The OTFs introduce a new abstraction layer that can be used to virtualize table storage. The key is that it allows for the separation of data from metadata and shared storage from compute. Through metadata, one table can appear in two data platforms, without data copying. To avoid overloading data virtualization anymore, I will use the term Table Virtualization.
The Modern Data Stack (MDS) failed to sustain itself. Few wanted to compose a data architecture from 10-15 different vendors. People want to choose a small number of trusted vendors, and they want them to work together without a lot of toil and headaches.

Common pitfalls when building generative AI applications by Chip Huyen

Use generative AI when you don’t need generative AI
Start too complex

Examples of this pitfall:

Use an agentic framework when direct API calls work. Agonize over what vector database to use when a simple term-based retrieval solution (that doesn’t require a vectordb) works. Insist on finetuning when prompting works. Use semantic caching.

Forgo human evaluation

To automatically evaluate AI applications, many teams opt for the AI-as-a-judge (also called LLM-as-a-judge) approach — using AI models to evaluate AI outputs. A common pitfall is forgoing human evaluation to rely entirely on AI judges.

Feb 22, 2025

Redefining Data Engineering with Go and Apache Arrow by Thomas F McGeehan V

Feb 16, 2025

Emerging Patterns in Building GenAI Products

Mixture of Experts Explained

Feb 15, 2025

Microsoft Introduces CoRAG: Enhancing AI Retrieval with Iterative Reasoning

The Quest to Understand Metric Movements - Pinterest Engineering Blog - Medium by Pinterest Engineering

root-cause analysis (RCA)
Slice and DiceThis approach finds clues for a metric movement by drilling down on specific segments within the metric; it has found successes at Pinterest, especially in diagnosing video metric regressions
General SimilarityIn this approach, we look for clues of why a metric movement happened by scanning through other metrics and finding ones that have moved very “similarly” in the same time period, whether in the same direction (positive association) or in the opposite direction (negative association).
practice, we have found that the first two factors, Pearson and Spearman’s rank correlations, work best because:p-values can be computed for both, which help to gauge statistical significanceboth have more natural support for measuring negative associations between two time-seriesnon-monotonic (e.g. quadratic) relationships, for which Pearson and Spearman’s rank correlations won’t apply, don’t tend to arise naturally so far in our use-cases / time window of analysis
Experiment EffectsThis third approach looks for clues of why metric movements happened by looking at what a lot of internet companies have: experiments.
For each control and treatment group in an experiment, we perform a Welch’s t-test on the treatment effect, which is robust in the sense that it supports unequal variances between control and treatment groups. To further combat noise in the results, we filter experiments by each experiment’s harmonic mean p-value of its treatment effects over each day in the given time period, which helps limit false positive rates. We also detect imbalances in control and treatment group sizes (i.e., when they are being ramped up at a different rate from each other) and filter out cases when that happens.

You Should Use /tmp/ More

Feb 14, 2025

How to add a directory to your PATH by Julia Evans

Bash has three possible config files: ~/.bashrc, ~/.bash_profile, and ~/.profile.

python-build-standalone now has Python 3.14.0a5 by Simon Willison

How to Use a Microphone

Feb 9, 2025

Scale Out Batch Inference with Ray

Systematically Improving RAG Applications - Jason Liu by Jason Liu

4 Re-Rankers¶ Instead of (or in addition to) fine-tuning a bi-encoder (embedding model), you might fine-tune a cross-encoder or re-ranker that scores each candidate chunk directly. Re-rankers can be slower but often yield higher precision. Typically, you do a quick vector search, then run re-ranking on the top K results.

Emerging Patterns in Building GenAI Products

Self evaluation: Self-evaluation lets LLMs self-assess and enhance their own responses. Although some LLMs can do this better than others, there is a critical risk with this approach. If the model’s internal self-assessment process is flawed, it may produce outputs that appear more confident or refined than they truly are, leading to reinforcement of errors or biases in subsequent evaluations. While self-evaluation exists as a technique, we strongly recommend exploring other strategies.
LLM as a judge: The output of the LLM is evaluated by scoring it with another model, which can either be a more capable LLM or a specialized Small Language Model (SLM). While this approach involves evaluating with an LLM, using a different LLM helps address some of the issues of self-evaluation. Since the likelihood of both models sharing the same errors or biases is low, this technique has become a popular choice for automating the evaluation process
Human evaluation: Vibe checking is a technique to evaluate if the LLM responses match the desired tone, style, and intent. It is an informal way to assess if the model “gets it” and responds in a way that feels right for the situation. In this technique, humans manually write prompts and evaluate the responses. While challenging to scale, it’s the most effective method for checking qualitative elements that automated methods typically miss.
However, embeddings are not ideal for structured or relational data, where exact matching or traditional database queries are more appropriate. Tasks such as finding exact matches, performing numerical comparisons, or querying relationships are better suited for SQL and traditional databases than embeddings and vector stores.

Why Most Machine Learning Projects Fail to Reach Production and How to Beat the Odds

Feb 8, 2025

Jujutsu VCS Introduction and Patterns by Kuba Martin

While in Git you generally organize your commits in branches, and a commit that’s not part of a branch is scarily called a “detached HEAD”, in jj it’s completely normal to work on changes that are not on branches. jj log is the main command to view the history and tree of changes, and will default to showing you a very reasonable set of changes that should be relevant to you right now - that is (more or less) any local mutable changes, as well as some additional changes for context (like the tip of your main branch).

Feb 7, 2025

The Rise of Single-Node Processing: Challenging the Distributed-First Mindset by Alireza Sadeghi

The End of the Bronze Age: Rethinking the Medallion Architecture by

Bias and Fairness in Natural Language Processing - Thomson Reuters Labs - Medium by Navid Rekabsaz

The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success by Ananth Packkildurai

Google Releases PaliGemma 2 Vision-Language Model Family by

Catching memory leaks with your test suite by Itamar Turner-Trauring

Jan 27, 2025

How the Apache Arrow Format Accelerates Query Result Transfer by Ian Cook, David Li, Matt Topol

ZenML VS Flyte VS Metaflow - MLOps Community by Ankur Tyagi

Jan 25, 2025

JavaScript Temporal is coming | MDN Blog by

Amazon (S3) Tables by Daniel Beach

Rill | Designing a Declarative Data Stack: From Theory to Practice by

The Hidden Cost of Over-Abstraction in Data Teams by Zakaria Hajji

Staff Engineer vs Engineering Manager by Alex Ewerlöf

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts by Brenda Potts

Jan 18, 2025

Lessons Learned Implementing Metric Trees by Ergest Xheblati

Jan 12, 2025

Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques | Amazon Web Services by

Apache Iceberg Won the Future — What’s Next for 2025? by Yingjun Wu

Building effective agents by

How I run LLMs locally by

Next.js 15.1 by

Jan 11, 2025

Hugging Face Smolagents is a Simple Library to Build LLM-Powered Agents by

Jan 9, 2025

A Pixel Parable by Facundo Olano

Jan 8, 2025

Hitting OKRs vs Doing Your Job by jessitron

Jan 7, 2025

Goodhart’s Law Isn’t as Useful as You Might Think by

Jan 6, 2025

Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward — Jack Vanlightly by Jack Vanlightly

Jan 5, 2025

How AI is unlocking ancient texts — and could rewrite history by Marchant, Jo

Glue work considered harmful by

Goodbye Github Pages, Hello Coolify · Quakkels.com by

Tools Worth Changing To in 2025 by Matthew Sanabria

Databases in 2024: A Year in Review by

Dismantling ELT: The Case for Graphs, Not Silos — Jack Vanlightly by Jack Vanlightly

Jan 4, 2025

Things we learned about LLMs in 2024 by

Jan 3, 2025

Change query support in Apache Iceberg v2 — Jack Vanlightly by Jack Vanlightly

Jan 2, 2025

Collection of insane and fun facts about SQLite - blag by

Jan 1, 2025

Scalar and binary quantization for pgvector vector search and storage by Jonathan Katz

Turbocharge Efficiency & Slash Costs: Mastering Spark & Iceberg Joins with Storage Partitioned Join by Samy Gharsouli

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality | Amazon Web Services by

Designing data products by

On writing and getting from zero to done — Jack Vanlightly by Jack Vanlightly

Introducing AWS Glue Data Catalog automation for table statistics collection for improved query performance on Amazon Redshift and Amazon Athena | Amazon Web Services by

React v19 – React by

Tech predictions for 2025 and beyond by Dr Werner Vogels - https://www.allthingsdistributed.com/

First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin) by

Use open table format libraries on AWS Glue 5.0 for Apache Spark | Amazon Web Services by

Migrating AWS Glue for Spark jobs to AWS Glue version 5.0 - AWS Glue by

Dec 31, 2024

Getting to Two Million Users as a One Woman Dev Team by

Dec 28, 2024

Why AI language models choke on too much text by Timothy B. Lee

AI-generated tools can make programming more fun by

Dec 22, 2024

The 150x pgvector speedup: a year-in-review by Jonathan Katz

How to generate unit tests with GitHub Copilot: Tips and examples by Greg Larkin

Introducing AWS Glue 5.0 for Apache Spark | Amazon Web Services by

Dec 21, 2024

Building Confidence: A Case Study in How to Create Confidence Scores for GenAI Applications - Spotify Engineering by alexandrawei

Storing times for human events by

DataFrames at Scale Comparison: TPC-H by

Enabling compaction optimizer - AWS Glue by

Top Python Web Development Frameworks in 2025 · Reflex Blog by

Building Python tools with a one-shot prompt using uv run and Claude Projects by

Dec 18, 2024

How to Speed Up Spark Jobs on Small Test Datasets by luminousmen

Dec 15, 2024

A high-velocity style of software development by

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt | Amazon Web Services by

A First Look at S3 (Iceberg) Tables by Nikhil Benesch

Dec 11, 2024

Amazon Bedrock Knowledge Bases now supports RAG evaluation (Preview) - AWS by

Dec 8, 2024

Amazon Aurora now available as a quick create vector store in Amazon Bedrock Knowledge Bases - AWS by

Why it took a long time to build that tiny link preview on Wikipedia by Jon Robson (David Lyall)

Dec 5, 2024

AWS Glue Data catalog now automates generating statistics for new tables - AWS by

Introducing AWS Glue 5.0 - AWS by

Dec 3, 2024

Table format comparisons - Change queries and CDC — Jack Vanlightly by Jack Vanlightly

Nov 28, 2024

Amazon S3 adds new functionality for conditional writes - AWS by

AWS Amplify introduces passwordless authentication with Amazon Cognito - AWS by

Nov 26, 2024

The Part of PostgreSQL We Hate the Most // Blog // Andy Pavlo - Carnegie Mellon University

What goes into bronze, silver, and gold layers of a medallion data architecture? | by Lak Lakshmanan | Sep, 2024 | Medium by Lak Lakshmanan

Using DuckDB-WASM for in-browser Data Engineering by Tobias Müller

Go talk to the LLM

The CDC MERGE Pattern. by Ryan Blue | by Tabular | Medium by Tabular

FireDucks : Pandas but 100x faster

#!/usr/bin/env -S uv run by

Amazon Data Firehose supports continuous replication of database changes to Apache Iceberg Tables in Amazon S3 - AWS by

Nov 25, 2024

Use data that looks like data - by Thorsten Ball by Thorsten Ball

Unit Tests As Documentation - by Teiva Harsanyi by Teiva Harsanyi

Nov 14, 2024

Anthropic’s upgraded Claude 3.5 Sonnet model and computer use now in Amazon Bedrock - AWS

It will be fun to see how the computer use api will evolve

It’s Not Easy Being Green: On the Energy Efficiency of Programming Languages

Nov 8, 2024

Embeddings are underrated

Oct 28, 2024

Technology Radar | Guide to technology landscape | Thoughtworks

Oct 23, 2024

AWS announces a seamless link experience for the AWS Console Mobile App - AWS

init.py files are optional. Here’s why you should still use them | Arie Bovenberg

Oct 9, 2024

Talks - Reuven M. Lerner: Times and dates in Pandas by PyCon US

Talks - Bruce Eckel: Functional Error Handling by PyCon US

What’s New In Python 3.13 — Python 3.13.0 documentation

Oct 7, 2024

Visual Studio Code September 2024 by Microsoft

Oct 4, 2024

NotebookLM | Note Taking & Research Assistant Powered by AI

Very interesting experiment from google which provides a summary of any article, video or file. Additionaly can generate a podcast talking about it!

Oct 1, 2024

Talks - Juliana Ferreira Alves: Improve Your ML Projects: Embrace Reproducibility and Production… by PyCon US

These truly looks like a good improvement to an ml project 👀

Sep 30, 2024

Talks - Krishi Sharma: Trust Fall: Three Hidden Gems in MLFlow by PyCon US

Table format comparisons - Streaming ingest of row-level operations — Jack Vanlightly by Jack Vanlightly

Embeddings · Malaikannan

Sep 29, 2024

Copy-on-Write (CoW) — pandas 2.2.3 documentation

Ptty big change to pandas 3.0 but one I think will bring a lot of clarity to data transformations

Announcing DuckDB 1.1.0 – DuckDB by The DuckDB team

Introducing Contextual Retrieval \ Anthropic

What is a Vector Index? An Introduction to Vector Indexing by Alejandro CantareroField CTO of AI, DataStax

Chunking · Malaikannan

Chronon - Airbnb’s End-to-End Feature Platform - InfoQ by Nikhil Simha

This is a very high level overview of the feature platform but allowed me to get a better sense on why use this platform. But I’d love to test it to understand if for those without streaming I’m this isn’t a bit overkill

Sep 28, 2024

Deno 2.0 Release Candidate

Time spent programming is often time well spent - Stan Bright

What I tell people new to on-call | nicole@web

What is io_uring?

Memory Management in DuckDB – DuckDB by Mark Raasveldt

Yuno | How Apache Hudi transformed Yuno’s data lake

Google Proposes Adding Pipe Syntax to SQL - InfoQ by Renato Losio

Sep 26, 2024

PostgreSQL: PostgreSQL 17 Released!

The sorry state of Java deserialization @ marginalia.nu

Interesting to see that duckdb is serving as the benchmark to process data in the order of GB’s

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Sep 24, 2024

@daily_cache implementation in Python • Max Halford

Good programmers worry about data structures and their relationships by Engineer’s Codex

Sep 23, 2024

I Like Makefiles by Sebastian Witowski

I tend to avoid make files because I hadn’t looked at how they worked and always associated either Java projects. I might want to take another look at them

Rethinking Analyst Roles in the Age of Generative AI by Ben Lorica 罗瑞卡

No, Data Engineers Don’t NEED dbt. | by Leo Godin | Jul, 2024 | Data Engineer Things by Leo Godin

Sep 20, 2024

Gotten my reading from +100 articles to 58! 🎉

Predicting the Future of Distributed Systems by Colin Breck

Continuous reinvention: A brief history of block storage at AWS | All Things Distributed by Werner Vogels

Iceberg vs Hudi — Benchmarking TableFormats | by Mudit Sharma | Aug, 2024 | Flipkart Tech Blog by Mudit Sharma

Splicing Duck and Elephant DNA by Jordan Tigani, Brett Griffin

How we sped up Notion in the browser with WASM SQLite by Carlo Francisco

Table format comparisons - How do the table formats represent the canonical set of files? — Jack Vanlightly by Jack Vanlightly

Sep 14, 2024

Should you be migrating to an Iceberg Lakehouse? | Hightouch by Hugo Lu

How data departments have evolved and spread across English football clubs - The Athletic by Mark Carey

How to use AI coding tools to learn a new programming language - The GitHub Blog by Sara Verdi

How to choose the best rendering strategy for your app – Vercel by Alice Alexandra MooreSr. Content Engineer, Vercel

Amazon’s Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2 | AWS Open Source Blog

Sep 14, 2024

Don’t Use JS for That: Moving Features to CSS and HTML by Kilian Valkhof by JSConf

AWS Chatbot now allows you to interact with Amazon Bedrock agents from Microsoft Teams and Slack - AWS

Astro 5.0 Beta Release | Astro by Erika

Making progress on side projects with content-driven development | nicole@web

AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables - AWS

Sep 13, 2024

Introducing OpenAI o1 | OpenAI

Rewrite Bigdata in Rust by Xuanwo

Recommending for Long-Term Member Satisfaction at Netflix | by Netflix Technology Blog | Aug, 2024 | Netflix TechBlog by Netflix Technology Blog

Til: reward engineering. Measure proxy features and optimize for them to get to your actual end goal

We need to talk about ENUMs | boringSQL by Radim Marek

I spent 8 hours learning Parquet. Here’s what I discovered | by Vu Trinh | Aug, 2024 | Data Engineer Things by Vu Trinh

Sep 12, 2024

What I Gave Up To Become An Engineering Manager by Suresh Choudhary

Unpacking the Buzz around ClickHouse - by Chris Riccomini by Chris Riccomini

Introducing job queuing to scale your AWS Glue workloads | AWS Big Data Blog

”SRE” doesn’t seem to mean anything useful any more

Microsoft Launches Open-Source Phi-3.5 Models for Advanced AI Development - InfoQ by Robert Krzaczyński

Sep 3, 2024

Production-ready Docker Containers with uv by Hynek Schlawack

Many of us can save a child’s life, if we rely on the best data - Our World in Data by By: Max Roser

Another article that I find eye opening when we look at data to improve our decisions

Why I Still Use Python Virtual Environments in Docker by Hynek Schlawack

Suddenly this way of working, very similar to node modules, is being talked everywhere.

Python Developers Survey 2023 Results

Was looking for this for a long time. Got to learn a bit more on twine, mlflow and sqlmodel. In terms of getting on track with python world I think I’m good with my current rss feed and podcast

Elasticsearch is Open Source, Again | Elastic Blog by ByShay Banon29 August 2024

Monitor data quality in your data lake using PyDeequ and AWS Glue | AWS Big Data Blog

Data drift detection 👀
Makes sense. Pydeequ is just a wrapper

How top data teams are structured by Mikkel Dengsøe

As I work on a team that doesn’t have this kind of distribution I need to reflect a bit on the impact of being the sole data engineer on an ML team

Talks - Amitosh Swain: Testing Data Pipelines by PyCon US

Aug 26, 2024

uv: Unified Python packaging

Wow, need to test this out on my side projects.

CSS finally adds vertical centering in 2024 | Blog | build-your-own.org by James Smith

Timeless Skills For Data Engineers And Analysts by SeattleDataGuy

Skimmed the article but although high level I find the main points very true. Understanding the system, being on top of the state of the art. That and the tips for head of data

CSS 4, 5, and 6! With Google’s Una and Adam by Syntax

Great episode! Nice to see that css is getting better and better

Aug 23, 2024

I’ve Built My First Successful Side Project, and I Hate It by Sebastian Witowski

Wow, I’d love this one man saas but knowing how it can burn you out…

NodeJS Evolves by Syntax

Nice episode on the long sought features for node that have been introduced on bun and demo (single file, typescript support and top await async are the big ones for me)

Talks - Brandt Bucher: Building a JIT compiler for CPython by PyCon US

Python Insider: Python 3.13.0 release candidate 1 released

Great to see a new version coming along! Is pdb worth using with vs code? 🤔

Google kills Chromecast, replaces it with Apple TV and Roku Ultra competitor | Ars Technica by Samuel Axon

As a google to owner would be good to know that my working hardware wouldn’t brick just because I don’t want to upgrade

Aug 6, 2024

Can Reading Make You Happier? | The New Yorker by Ceridwen Dovey

Beyond Hypermodern: Python is easy now - Chris Arderne

Although I had my eyes already on rye I certainly didn’t know it was so full feature. Gotta try and change some of my projects to it ,

Visual Studio Code July 2024 by Microsoft

Python support slowly getting good

Introducing GitHub Models: A new generation of AI engineers building on GitHub - The GitHub Blog by Thomas Dohmke

Creativity Fundamentally Comes From Memorization

Gen AI Increases Workloads and Decreases Productivity Upwork Study Finds - InfoQ by Sergio De Simone

The report raises a good point about an increase in workload. With a big productivity boost bosses might start loading even bigger workloads than what we gained from the productivity boost

Ofc this isn’t good as people will feel overloaded

tea-tasting: a Python package for the statistical analysis of A/B tests | Evgeny Ivanov by Evgeny Ivanov

Super interesting IMO. Might give it a try on my team when we start deploying solutions to our clients

Data Science Spotlight: Cracking the SQL Interview at Instacart (LLM Edition) | by Monta Shen | Jul 2024 | tech-at-instacart by Monta Shen

Would be interesting to test this out. Have an example of a dataset that can be queried using duckdb. Given a question understand if a query is correct how to fix it and improve it’s performance. One in Sal and another in pyspark (or ibis/pandas)

Jul 31, 2024

Are you a workaholic? Here’s how to spot the signs | Ars Technica by Chris Woolston, Knowable Magazine

At the Olympics, AI is watching you | Ars Technica by Morgan Meaker, WIRED.comCrazy to see that now with ai it’s actually possible to survey an entire city either cameras

Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect by Databricks

English sdk looks awesome but requires an OpenAI key. Could it be replaced with ollama?

Jul 26, 2024

Let’s Consign CAP to the Cabinet of Curiosities - Marc’s Blog by Marc Brooker

Interesting topics to research a bit more on CAP alternatives

Why Your Generative AI Projects Are Failing by Ben Lorica 罗瑞卡

In summary, to have a good AI product we need to have data with quality which requires good data governance. With this data we need to define useful products that we can measure it’s value using data driven metrics. We must also ensure the product has good practices avoiding security or bias issues

Engage your audience by getting to the point, using story structure, and forcing specificity – Ian Daniel Stewart

Resumed by:

storyline talk

Slack Conquers Deployment Fears with Z-score Monitoring - InfoQ by Matt Saunders

This is something I would love to implement. Allowing to define the metrics on which to evaluate a new feature, the expected hypothesis and revert the feature (i.e feature flags) automatically with a report on the experiment

DuckDB + dbt : Accelerating the developer experience with local power - YouTube

Could I replace Athena with this? I think the main blocker for me is I want to work with S3. And need to check how it runs for a really large dataset…

How Unit Tests Really Help Preventing Bugs | Amazing CTO

Good tip. For any project define the metric of code coverage goals and start increasing on the project

Mocking is an Anti-Pattern | Amazing CTO

Building an open data pipeline in 2024 - by Dan Goldin by Dan Goldin

Jul 25, 2024

Spark-Connect: I’m starting to love it! | Sem Sinchenko by Sem Sinchenko

This article wasn’t properly parsed by omnivore but the big takeaways:

We can add plugins for extended functionality to our spark server
Using spark connect we can implement any library in any language we want and send grpc requests to the server (spark connect server needs to be running)
spark connect works on >3.5. Should be much better on v4.0
If I truly want to be good in spark I eventually need to relearn scala/java
Glue is cool but it’s still on version 3.3. All these goodies will take too long to be implemented in glue

So you got a null result. Will anyone publish it? by Kozlov, Max

After reading the statistics books I can see much more clear the value of proving a null hypophesis, this is the feeling I am getting out of the academia. We are seeing more research without any any added value. Goodhart’s law.

Maestro: Netflix’s Workflow Orchestrator | by Netflix Technology Blog | Jul, 2024 | Netflix TechBlog by Netflix Technology Blog

Sounds just like an airflow contender. with the plus of being able to run notebooks 🤔

How to Create CI/CD Pipelines for dbt Core | by Paul Fry | Medium by Paul Fry

Simplify PySpark testing with DataFrame equality functions | Databricks Blog by Haejoon Lee, Allison Wang and Amanda Liu

Good theme for a blog post on the changes of spark 4, this is really useful for human errors (been there multiple times)

How to build a Semantic Layer in pieces: step-by-step for busy analytics engineers | dbt Developer Blog by Gwen Windflower

So this will be generated on the fly as views by the semantic layer?, This looked neat until the moment I understood that the semantic layer requires dbt cloud

Meta’s approach to machine learning prediction robustness #### Engineering at Meta by Yijia Liu, Fei Tian, Yi Meng, Habiya Beg, Kun Jiang, Ritu Singh, Wenlin Chen, Peng Sun

Meta seems to be a couple of years ahead of the industry. The article doesn’t provide a lot of insights but gives a feeling of their model evaluation being mostly automated + having a good AI debug tool

Free-threaded CPython is ready to experiment with! | Labs

First step in a long way before we can get run python wihtout GIL. Interested on seeing if libraries like pandas will be able to leverage multithreading eventually with this

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview | AWS Big Data Blog

Mixed feelings here. Great to see Open Lineage implemented at AWS. However it feels again that AWS just created the integration and won’t be driving the development of open lineage

Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack - Slack Engineering by Nilanjana Mukherjee

What could be improved to help this kind of migrations be done in a matter of days?, Livy might be deprecated in favor of spark connect. With their migration to Spark 3 and eventually 3.5 (not clear on this article) they could be interested in moving new jobs to connect , Basically solved issues by using the old behaviours. These will need to be migrated eventually. Would need to better understand these features , This looks like an important detail. With no explicit order spark can have random order of rows?, Cool to see these migrations and teams using open source solutions. EMR although expensive with a good engineering team can prove to be quite cost effective

Data Council 2024: The future data stack is composable, and other hot takes | by Chase Roberts | Vertex Ventures US | Apr, 2024 | Medium by Chase Roberts

Reading Feed