The Medallion Architecture Farce.

I can no longer hold the boiling and frothing mess of righteous anger that starts to rumble up from within me when I hear the words “Medallion Architecture” in the context of Data Modeling, especially when it’s used by some young Engineer who doesn’t know any better. Poor saps who have been born into a Databricks world where that fresh, supple mind has been polluted and twisted by the machinations of a marketing department.

Look, I am a daily user of Databricks, I have no axe to grind with them in particular. But the false gospel of the “Medallion Architecture” wreaked havoc on a generation of Data Engineers.

Most of the time, our SaaS overlords can be trusted to produce and distribute whatever new technology they are best at. The problem arises as they grow in size and appetite to consume every bit and byte, in a never-ending quest to claim the “end-to-end” platform. Marketing gets involved with DevOps and Sales Engineers, and chaos ensues.

You should never forget that you must hold these SaaS vendors at a skeptical arm’s length and take what they say with a grain of salt. Until you have steel-manned their propositions and proved what they say isn’t just to line their pockets, the souls of unwhitting developers are at risk.

Think about it.

Call me and old crumgeon, whatever, it’s very possible I’m an old dog that simply can’t learn tricks, if that’s the case, bury me in the prairie. But, one must look the the past, examine what has been accepted and used for decades, things that have worked in Production since before you were born. This must be balanced with a healthy dose of … new things CAN be good and an improvement.

I shall propose to you that this so called “Medallion Architecture” is NOTHING more than Marketing Speak.

The earliest public mention I can find from Databricks is June 2, 2020, in a blog post on monitoring audit logs that references “our medallion architecture.” Databricks

Shortly after, other Databricks posts in June and September 2020 also used the term and diagrammed the Bronze/Silver/Gold flow, confirming the concept was already in use by then. Databricks+1

So: first public reference → June 2, 2020 (Databricks blog).

Let’s be clear about something, Databricks says “OUR medallion architecture.” Keyword our.

It’s all about Data Modeling.

The long and short of it is that medallion architecture IS an approach to Data Modeling. From this respect, maybe I should just leave Databricks alone, since Data Modeling is in the eye of the beholder. My problem with the whole thing is that we have lived with the Staging / Fact / Dimension / Data Mart nomenclature since the beginning of the Data Warehouse era.

It was, and is … really a TWO step program, sometimes THREE when called far, but %90 two stepper.

RAW -> FACT/DIM

You load raw data … you transform that raw data into its final state and then write to fact or dimension.

Done.

Databricks’ medallion architecture is so confusing that people are still asking about it on Reddit. Asking the obvious question that everyone probably asks. More or less …

What the CRAP is the difference between SILVER and GOLD???

Proof that the marketing gurus at Databricks failed in their task of brainwashing people CORRECTLY can be seen in the most voted answer …  “Gold layer should also be your data contract with your consumers.

This is a non-answer; this is confusing. It has never been mentioned by Databricks inside their documentation around medallion architecture. Wby? Databricks isn’t dumb enough to talk about Data Contracts in the midst of their data modeling ramblings. Heck, half the data world wants to stab anyone in the eye who mentions Data Contracts.

So, what the Reddit mob has surmised is that medallion architecture is directly linked to data contracts? Give me a break. Layering one snake oil on top of another.

Proof people do whatever they want and call it what they want.

In truth, what Databricks was calling the Gold layer was really a “Data Mart,” which was a further aggregation of analytics based on a Fact table. However, in the real world, you would have numerous fact tables that feed all the major reports and downstream consumers, with only a few consumers requiring further aggregation in the form of a Data Mart.

What happened was that the writers of medallion architecture at Databricks simply didn’t have the proper background and experience to understand these nuances. They sold the idea that you HAD to have three layers, Bronze, Silver, Gold for every dataset (You do realize that this required more storage and compute, which goes directly into the pockets of Databricks, right??)

You should model your data in your Lake House the same way it was modeled in the Data Warehouse. It works quite well and is not confusing, and based on marketing material, about about three decades of proven use.

  • load raw data
  • transform raw data into Fact or Dimension,
  • create Data Marts of aggregation where needed.

What, you think this IS medallion architecture?

Well, you hobbit, which came first, Kimball or Databricks? All Databricks managed to do was mislead a generation of developers and confuse them, which is front and center in this Reddit post.

I still can’t get over the fact that people are now conflating Data Contracts with the medallion architecture (data modeling). This is hillarious. They deserve each other and the misery each of them bring to the other.

What a concoction of marketing drivel forged in the heart of Mordor.

The best and most timeless solutions are simple.

If something seems confusing, can’t be explained the same way by multiple people, and requires three backflips and arguments about nomenclature and other tom-foolery, then you can assure yourself that perhaps you have found a wolf in sheep’s clothing.