28 Comments
User's avatar
Michał Poczwardowski's avatar

Can you EXPLAIN which part of the post took the longest to write? Just kidding

You put explain and indexing in the Pro section. I reckon that the target group is managers - for engineers, that is something we required really early on to make sure that they think about the performance.

Great article - Thanks for writing it!

Anton Zaides's avatar

😂

Honestly, I didn't meet a lot of engineers who new how to properly use EXPLAIN.

Michał Poczwardowski's avatar

So it sounds like it's worth EXPLAINing 🙂

Switching gears a little bit: SQL is a declarative language. I don't know why, but I think about it every time I see it in action. The concept of declaring what we want to get without needing to write procedures to get it is great. TBH It is close to what we are getting recently with LLMs 🤔

Anton Zaides's avatar

I agree! It's much more straightforward than anything else. In contrast to LLMs, you do need to 'work' a bit to get what you want from it, but it's like a logical puzzle with very clear (and limited!) set of rules.

Denis Baltor's avatar

What a great post!

Anton Zaides's avatar

Thank you Denis!

Junaid Effendi's avatar

I think SQL is important for everyone who deals with data, its a universal data language.

With so much data, one should know sql, although with chatgpt lot of things will be simplified.

Anton Zaides's avatar

I think it’s ok to use LLMs to simplify it, but you shouldn’t depend on it. You must understand what you are doing when you work with data.

Michael Pope's avatar

What was the most difficult part about learning SQL for you? I see the knowledge drop off in your graphic at subqueries and I can sympathize with that 🤣

Anton Zaides's avatar

I think for me the CTEs were the things I took the longest to learn - not really because they were hard, just was not familiar with them :)

Karthik Subramanian's avatar

As someone who did a lot of SQL + data pipelining work in a growth engineering context, this article is a great refresher!!

Very good running example to illustrate each point (eg: rank/window func, aggregations)

Anton Zaides's avatar

Thank you Karthik, glad it was useful! :)

Prajakta's avatar

Great article summarizing all of the SQL concepts!

Acquier's avatar

The first query can be done without subquery:

select deal_id,region, customer_name, MAX(deal_amount)

From deals d1

GROUP BY region;

Anton Zaides's avatar

But then you can’t select the deal data, just the amounts

Raviraj Achar's avatar

Technical depth aside, this was written amazingly well from a readability point of view. Loved the hook! You are a great writer 😀

I only did #6 when I worked on a windows app and we cared about sql performance (sqllite). Now sql is just a tool to do analysis these days.

Anton Zaides's avatar

Thanks Raviraj! The hook was rewritten like 10 times :) In the end this one didn't do so well in the open rate, but very well on reddit, once I tried 'Why most developers stop learning SQL at subqueries - a 5-minute guide for PARTITION BY and CTEs' 🤷‍♂️

Even as a just-analysis tool, it can be SUPER useful for answering complex questions

Kevin Naughton Jr.'s avatar

Thanks so much for the mention Anton! :)

Kyle Mechler's avatar

I really appreciated the article! I've used partitions in the past but it has been a while and this was a great refresher. Plus I pushed to learn about CUME_DIST() that you teased us with and I have an immediate use for it!

Anton Zaides's avatar

Glad you’ve caught the teasing :)

I’m curious to know about the use though 🙃

Engineer's Codex's avatar

Great article Anton!

Anton Zaides's avatar

Thanks Leo! ❤️

Akos Komuves's avatar

It's good to see more technical articles on Substack! Looks like I don't have to limit myself to the advice category and post the technical things exclusively on my blog.

It's great to be hands-on with SQL! I never tried ChatGPT for queries because at the complexity where my tables and apps are, I can get by as a lvl 4 SQL player, although I remember touching on Partitions for something. 😃

Anton Zaides's avatar

I actually hesitated a lot about the topic, as it’s out of the regular kind that I posted here. I finally decided to go for it as I really enjoyed writing this piece, and I think it’s useful for managers too (which are my main audience).

Sridaran Thoniyil's avatar

Thanks for the great article!

Nathan Pankowsky's avatar

Funny, because one of my first jobs was a decoder at Cellebrite, I know the binary format better than the complex sql queries

suman suhag's avatar

The original proof only requires that when there are overlapping reads and write to the same storage, only the the write must be correct, for the algorithm to function properly.

And yes, it assumes that there is cache coherence among all actors, and therefore that the writes will be idempotent across all nodes — which in fact is what guarantees coherency.

The algorithm was developed for cache coherent MESI coherence model systems, or better.

So no, it does not include clusters, or distributed systems, unless they also guarantee operational idempotence.

For databases in particular, including the subset of instances of database implementations we call”filesystems”, that the overall system provides ACI(F)D guarantees.

Databases which only supply BASE, like NoSQL implementations, need not apply.

Or in other words, don’t expect Lamport’s to work on those systems, and pick a different algorithm instead,because in the context of the proof onthose systems, the proof no longer is a proof.