How to solve every data problem in SQL - Episode 09
CLAIRE: I want to welcome everybody to Path To Citus Con, the podcast for developers who love Postgres, where we discuss the human side of open source and databases, Postgres, of course, and the many PG extensions. This podcast is now available on all of your podcast platforms. You can get to past episodes and get links to the various podcasting platforms at aka.ms/PathToCitusCon, all one word.
Transcripts are included on the episode pages on Transistor too, and I want to say thank you to the team at Microsoft for sponsoring this community conversation about Postgres. I'm Claire Giordano.
PINO: and I'm Pino de Candia, and today's topic is how to solve every data problem in SQL. I'm honored to introduce Dimitri Fontaine, a Postgres contributor who works at Microsoft and has been working in the Postgres community since 2005.
Dimitri co-founded pgDay Paris. He's the author of Postgres CREATE EXTENSION back in 9.1 and I believe CREATE EVENT TRIGGER in 9.3. He is also the author of pgloader, pg_auto_failover, and more recently, pgcopydb. And he wrote a book, The Art of PostgreSQL, that is aimed at app developers. Dimitri, welcome.
VIK: Thank you, Pino. Thank you for having me.
CLAIRE: And I would like to introduce our second guest, Vik Fearing. He is also a Postgres contributor. He works at EDB, has been working on Postgres since about 2008, I think. Co-founder of pgDay Paris. In fact, the conference sponsorships just opened and the CFP is open as well. Registration, not quite yet, but it'll come. And Vik is a co-organizer of this year's upcoming PGConf EU, which is happening later in December in Prague. Welcome Vik.
VIK: Thank you. Thank you for having me.
CLAIRE: We're thrilled to have both of you here. Okay, so let's dive in. The topic today, as Pino already said, is how to solve every data problem in SQL, or S-Q-L.
Okay, let's just get the pronunciation right. What's the pronunciation of this thing?
PINO: Well, I got to watch Vik's, video of Vik, from 2022 where he said the official pronunciation is S-Q-L. Vik, did I get that right?
VIK: Well, you called me Nick, but otherwise, yes.
PINO: Oh, Vik. Sorry.
CLAIRE: I assumed he was talking about someone else who made a video, but this is probably a common, topic of conversation: pronunciation in the Postgres world.
VIK: Yeah. Well in the entire S-Q-L world. The correct pronunciation, according to the standard is S-Q-L. Prototypes of the language were called SQL for a standard, or Structured English Query language or whatever it was called. But as for the standard, as it exists now, and as it existed in 87, and then unfortunately commonly 92, it is S-Q-L. One of the reasons is that the committee thinks that users are ill served by claims that the word "structured" accurately describes the language, which it does not. S-Q-L is not very structured, but it is a query language. So some people call it a recursive acronym, which is S-Q-L stands for "SQL is a query language".
CLAIRE: So, I'm curious to see why each of you thinks that we invited you here today to talk about solving every data problem in SQL. You were both recommended by a couple of different people to be like, oh, the right guest for this topic is Vik. The right guest for this topic is Dimitri. Why is that? Why are you the right people to talk to about this?
DIMITRI: Yeah, I can go first on this one. I think for me it's, it's kind of obvious because, I've put a lot of efforts trying to, to talk about this topic a lot, both, in my blog, I started writing articles about, how to do more things with a SQL, maybe back in, 2009, 2011, something like that and it grew up, as far as, me writing, authoring a book about it. Like, The Art of PostgreSQL is mostly about that. Like, how to benefit the most, from SQL as a programming language or, as an application development platform. So I believe I managed to get the image of someone who's talking about that topic. Yeah.
VIK: I think for me, it comes from the fact that several years ago, I was dared, and accepted the dare, to do all of the Advent of Code challenges in SQL. Advent of Code is a yearly challenge in December where there are two puzzles every day that you need to solve and you can do it in whatever programming language that you want. And I started doing it in SQL. And now there are several other people who do it in SQL. And that's great because I get to learn from them as they learn from me and we all learn together.
CLAIRE: Were you successful?
VIK: Well, I've never been successful in doing all 50 stars. Because there are two stars a day from December 1st to Christmas, December 25th.
I've never been successful in doing them all, but I have been successful in doing quite a lot of them, yes.
CLAIRE: And when you say you were never successful in doing them all, is that due to lack of trying? Like, you took a day off because you wanted to be with family, or is it because some of them were just fairly tricky to implement in SQL?
VIK: Well, to be completely honest, SQL is not the right tool for the job for most of them. SQL is a data language, or more precisely, a data sublanguage. And, it's job is retrieving data and analyzing data. It's job is not calculating various algorithms like, routing planning.
CLAIRE: Got it. Okay. So, Advent of Code is how you had your name made in this space and then, Dimitri, between your blog and your book, that's why people think of you when they think of SQL. Sounds like if I'm taking away what you both said, but I'm, I'm still curious to go back into like, how did you each get started? Like why did you start working? So Vik, you work on the SQL standard a bit.
Is that right?
VIK: That's correct, yes. I'm a member of the SQL committee.
CLAIRE: Okay. Why? When did that start? Why did that start?
VIK: Well, I've always been interested in languages, be it human languages or computer languages and, when I first read the standard, as opposed to reading Postgres documentation I found that there were a lot of beautiful things in it, that Postgres does not yet implement.
And then I'm hoping it will sometime soon. And so I just wanted to be a part of it. And so I have been a part of it since, September of last year. So a little bit over a year. And it's, it's been, very interesting to see how the sausage gets made and, how the standard becomes what it is. And just last month, I got, two features of my own, into the standard.
So, that was exciting for me.
CLAIRE: What did you do to celebrate?
VIK: I had a beer.
PINO: Can we ask what were those features, Vik?
VIK: So they're both related. One of them is to provide a default value to the cast specification so that if you cast something to a different type and it's not able to be cast to that, instead of erring and killing your query in your transaction, it provides a default value.
The other one, which is very close to it, is being able to test if something will be able to be casted to a various data type. So, for example, If you have a text value, and you know that it's supposed to be a date, but you don't know what format it's in, if it's in the ISO format, or if it's in world format, or if it's in American format, then you can say, in case when your value is castable as a date in the format of this, then do that.
When. Value is castable in the format of date, with this format, then do something else and so on and so on. So it allows testing what your data is without causing any errors.
PINO: And do you know if anyone's working on getting these implemented in Postgres?
VIK: I do. Yes. Corey Huinker is currently working on, getting those implemented in Postgres.
CLAIRE: Awesome. So, that's how you got started in the SQL standard committee. Dimitri, why did you start caring about, I mean, when I think of your work in SQL, Dimitri, I guess I have a copy of your book based on the original name. So it's not The Art of PostgreSQL that I'm holding in my hands right now. It was formerly called Mastering Postgres in Application Development and I always thought of it as really targeted to app developers and helping them learn what they needed to learn about Postgres and also SQL, but what I don't know is why you wrote the book.
DIMITRI: Ah, yeah, that's a good one. Well, basically, writing the book was kind of a brain dump for me.
I used to be a consultant and do a lot of training, but also, I had the chance to go speak at conferences that were not PostgreSQL specific, like PHP conferences or something like that, Python conferences too. And so to meet with a lot of developers who were a little confused about this SQL thing and for me, I got into SQL because of two things. The first one is I'm lazy. So if you open up any website that is trying to explain how to sort data, it's pretty complex pretty often, and you have a lot of different algorithms that are meant to be the best at the situation when you have this kind of data, or when it fits in memory, or when you, et cetera, et cetera.
And it's pretty complex. But then if you want to make sure to sort the data in a convenient way, whatever the data looks like, and whether it fits in memory or whether you need to spill to disk or etc. In SQL, it just ORDER BY, that's all you need to know about it. So if your data is already somewhere that you can use SQL, well then all you need to know about this very, very complex and fascinating topic is order by and that's it. It's done.
And, I've always been a little fascinated by, making complex things easy to use and simple to understand. And so for me, SQL was that it was filling a gap where some of the things are very complex. You need them a lot in a lot of situations all the time, basically, and you want to make them simple.
In many cases, SQL is doing exactly that and in some cases it's not that simple anymore, but, that's a good start. Yeah. And the other part of it, apart from being a little lazy, is the transactional aspect of it, where for correctness, you need to be able to push some of the processing down to the transaction, which means that you need to master more of the SQL language, just to be able to have a correct implementation of what you want to do.
VIK: That's one of the big differences between SQL and other languages is that SQL is declarative. So as Dimitri just said, all you need to do is say order by this and you don't need to say how you want it, how that should happen, the database engine itself, so in our case PostgreSQL is responsible for figuring out what the best way to do that sort is.
CLAIRE: Okay. So when, let's assume there's a college student listening to this podcast later, when you say SQL is declarative, can you just put a definition to that beyond the fact that the database engine is responsible for figuring out?
VIK: Well, in there are two main, programming paradigms today. There are many others, but two of them are popular today. One is functional, where everything is a function and you pass functions between things and other things. And the other is imperative, where you tell the computer what to do. SQL is different because it is declarative, which means you don't tell the computer what to do.
You don't tell the server how to get you the data. You just say, I want this data, and this is how I want it. And then the server figures out how to do that. Because the server itself is of course imperative. But that's not what the developer uses. The developer uses, the declarative language of SQL to just say, this is what I want.
And that's what Dimitri was saying about, not having to figure out algorithms or anything to say, this is, this is what I want. Give it to me.
DIMITRI: Yeah. Basically to fill in or to amend or add, sorry, to add to what Vik is saying, everything Vik said is exactly correct. But, another way that I like to look at it is that you need to declare the result of the operation.
Rather than how to implement it, which is a little complex to do because it makes it very hard to write SQL if you don't know exactly what you want to achieve with imperative languages as a beginner developer. You know you're going to have to loop around something, so you know how to copy paste, you know, the skeleton of a loop, or a template that looks like looping over an array, and maybe computing the min, max, or the sum, or something like that.
And then you can adjust and arrange until you have what you wanted. There is no way to do that in SQL. In SQL you need to think very hard about what exactly do you want to do and then when you figure it out, which my advice for that is, can you write what you want to do in a single sentence?
So if you're, speaking in English, usually that would be an English sentence. Just do it in your own language, but, if that's not English, I mean, but just try and write what you want to achieve in a single sentence. And when you have done that, then writing SQL becomes way more easy. It's not always that easy, but, at least you get, you know where to get started and what to do. And, and what you need to do is declare the results that you want to obtain. And you never, or almost never have to think about how are you going to get from the data on disk to the, the result of the query. That's, that's not your job.
VIK: Yeah, and just a comment that Dimitri actually, invalidated by the end of what he just said.
It's not telling it what you want to do. It's telling what, telling it what you want to have.
DIMITRI: Yes, exactly. Yes.
CLAIRE: You know... This, this analogy, I don't know how many of you this will resonate with, but, during my years as a PM, that was actually one of the trickiest things to learn, especially coming out of engineering, I had to learn how to write requirements where I was focused on what was the end result that I wanted, and I had to completely ignore implementation details.
I had to ignore the how and the whole process of doing that because it's super easy to spec the implementation or the how that you think you might want. Or the way something should be solved, but I had to keep peeling the onion, peeling the onion to figure out, no, no, no. What am I trying to get to just give the, you know, what you want to have, what you want to result with and let engineering figure out the implementation.
I don't know. Does that analogy resonate with anyone? Am I the only former PM here?
VIK: No, that's, that's exactly right. Telling the database what to do is micromanaging. So as an SQL developer, what you need to do is just tell the database what you want and let it figure it out.
CLAIRE: Very cool.
PINO: And it sounds like we we're talking about that paradigm shift when you're coming from a language like C Java or Python, ruby whatever to, SQL and then learning the language is quite extensive and learning how to do what you want in the language. Then there's another aspect which is deciding what functionality to move from your app or, yeah, from your application code into, your database. Is that, is that an art or a science?
VIK: It's both
CLAIRE: Said in unison, it's both
VIK: Dimitri better prepared to talk about this than I am, but I just want to, before he answers this, I want to specify that SQL is a sub language. It's not a language. You're still doing all your application code in whatever language you're doing it. You're getting data retrieval from SQL
DIMITRI: Yeah, it's fair to remind that Prolog for example, the Logical programming language.
You can do everything in Prolog even the UI Prolog has been extended from just having logic infrastructure logic, things to do with, doing all the things that you need as a programmer. SQL has not been extended that way at all. It just, it just focus on one area of your development.
And yes, SQL is pretty good at it. And, well, I can't remember the question now.
PINO: My question was about deciding what, functionality to move into, into the database. I mean, you gave the example of a sort, right? Or a filter where you don't want to pull the data to your app, right?
That kind of, is intuitive. But are there other trickier parts? Yeah.
DIMITRI: So art or science. So the science part is, you can think of it with two main aspects to it, in terms of, correctness and efficiency. So you want your code to be correct, which means that you need to think about transaction boundaries.
And, what often happens, which also is a paradigm shift is that when you write the code, usually it's easy to think about one user at a time. So you're trying to make it, work for the workflow of one user, you know, checking in something, adding to the basket and then, check out, pay for it, et cetera, et cetera.
And you're focused when doing the development on what happens to that user. But, when you reach the part where you put an item in the basket and pay for it, maybe you want to make sure that that item is no longer available for anyone else to purchase. And then you need to think about the overall system, the system as a whole.
Not just that user, but every other user that is currently using the system. And, that's where there is a boundary between usually what you can think of as being relevant to the application side of things and then what's relevant to the database system, which needs to be correct for every user at every time.
And so that's the framework I like to use to try and think about that piece of code that I'm writing. Does it need to be correct for only one user at a time or the whole, you know, database, the whole system? At once, in which case I'm going to rely on, on SQL to be able to implement it. So that's the correctness aspect of it.
And the, the efficiency aspect of it is, like you said, and you, you began talking about that. There is, basically, processing and data. Sometimes you want to push the processing close to the data. And sometimes you want to move the data close to the processing. For example, for instance, if you do a count of all the data in your database, well, the result is going to be a single number.
Whatever the number of rows or items that you want to count, the result is one single number. So ideally, in terms of efficiency, what you want to retrieve is a single number, and you're done with it. And so you're going to push all the count complexities within the SQL query to be able to achieve that level of efficiency.
VIK: Sometimes you don't, sometimes we'll use, and this is a wrong thing to do. This is an anti pattern. Sometimes people count everything and then just. Look to see if the count is more than one. And that is absolutely the wrong thing to do. SQL has a clause called, exists where you can just say, is there one? And then the server will stop counting as soon as it finds one.
CLAIRE: And it would be the wrong thing to do to count because you would waste all sorts of cycles counting more than you needed to, you could have stopped at one. Is that what you're saying?
VIK: Yeah. You want to know if there is one? And by, and to do that, you count a million of them and that's a waste of processing.
So it's kind of not really related to what we're talking about though.
CLAIRE: That's okay. Well, but as I was preparing for today's conversation, I was chatting with Daniel Gustafsson, who is one of the people who recommended both of you for this topic, by the way. And, He asked me to specifically ask you about these, these anti patterns, if you will.
So the example of an, an anti pattern he gave me was the practice of performing joins in the application code that people do that because the database doesn't scale, right?
VIK: Yeah, because they think it doesn't scale. Yes.
CLAIRE: Yes. Because they think it doesn't scale.
DIMITRI: I have another theory about that, but it makes me very unpopular each time I say it.
So first, wave to Daniel. Cheers, mate. And then about that, I think most people, they're on SQL on the go, if you will, that a very few number of people that I've met with, I've actually spent time to just learn SQL properly and then understand it. So what are the essential concepts of SQL?
What is it that it exposes to you? What are the, if you do Java, it's object oriented, you need to understand. Object modeling, for Python, you need to understand that and also imperative programming. If you do Lisp, you need to understand functional programming. Same with the ASCAL or some other F sharp, Scala, some other languages, platforms, et cetera.
But for SQL, most people, they don't stop and think about it.
CLAIRE: Well, it's a tool and they just learn the tool to do the thing that they want to do right now. And then they got to move on to something else, right?
DIMITRI: They copy paste, and they use it as a tool, but there is more to it. So the main concept that you need to understand for SQL, the same as objects in object oriented programming, in SQL it's a relation.
What is a relation? And if you want to understand that, quickly enough, you need to understand what is a join, because it's the main thing that you can do with a relation. It's a join. And then you have anti joins and semi joins and other ways to model and think about it. And then there are ways to implement a join, but it's not a concern to understand the concepts.
It's a concern to implement, to understand the implementation and the execution of a query. But for the concepts, you just need to think about join, anti join and semi join. And, if you don't understand what is a relation in SQL and then what is a join, you're going to make your life miserable. It's going to be very hard to understand SQL.
And then I think in many cases, developers end up implementing a join in their In the other side of the programming language, just because they didn't think about it, and they don't have the tools to think or the concepts to think about it, so they did not even realize they would do that. I think that happens in many, many, many cases.
PINO: A quick, a quick detour, and then we can come back to the anti patterns. But I want to ask you, Dimitri, about, I heard you in a talk, not being very happy about ORMs, Object Relational Models. And I wanted to ask, though, that, you know, can the language compiler, help us here? So, for example, the object relational model, gives us patterns that will, result in SQL being built for us if we don't know SQL well.
So, how do you feel about that?
DIMITRI: Yeah, very strongly.
VIK: Me too.
DIMITRI: So, ORMs should be the most simple thing in the world if it was done that way. Look, in SQL what you do is manipulate relations and every SQL query is going to define a new relation, right? A relation it comes from the mathematical area of science and things.
So, what is a relation? It's a collection of objects that all have the same list of attributes, they call it attribute domains in the slang, but basically an object is a number of columns, and each column always in the same order, and each column has always the same data type. Basically, that's all you need to know.
And then you have a collection of them, that's a relation. And then there is the thing, is it a set or a bag? So SQL said it's a bag and then, you know, okay. But, it's not more complex than that. It's a collection of objects and all the objects are structured in exactly the same way. That's all you have.
That's a relation. Every SQL query is going to give you a relation with a zero, one or more entries in there. And every programming language in the world is going to be able and have some libraries and, internal stuff, built ins, helpers that allows you to manage a collection in memory, right? So the ORM means object to relation mapping.
And what you want to map is the result of a SQL query, and that's a collection of objects. And so you want to map that result of a query into a collection of objects in memory that you can use in your programming language. That should be the easiest job on the planet. The trouble is that, in SQL, there is this statement, named table.
So you can output the contents of table. And, because it's a SQL query, well, then the result is a relation, right? But it's not the most interesting one. It's just, you grab whatever is in the table, but basically the overhands, they understand that that is a relation and they stop there. And so they want to handle tables rather than think about it as a relation.
A table is just a relation, like any result of a query, like any result of a join. Like we said before, a join is a tool that you use to build a new relation from two existing relations, and the result is a new relation. That's all it is. So the R of my personal take on that is that, most people, they didn't stop and think about it and, they didn't really understand the R of ORM, what is a relation and, what are we going to do with that?
And ORMs, they usually try to solve three different problems that are very interesting and that maybe you want to solve, but, because they want to solve the three of them at the same time, I believe that a lot of ORMs that I've seen are not doing a very good job at it, let's say it that way, because each time I'm told that I'm not very good at ORMs, which is true, and that some of them are very good, which might be true, but because of the first part, I don't know, so maybe some of them are very good out there, so just list them if that, if that's the case, but yeah, that's my experience with ORMs.
VIK: Yeah, I would say that in addition to that, because I fully agree with Dimitri that there are two that I know of that are kind of reverse ORMs, which do take into account the fact that the relation is the result of the query and not just some table. Most ORMs just map tables to objects and that's it.
The two that I know of are POMM, P-O-M-M, which is for PHP. And JOOQ, J-O-O-Q, for Java. And those two I would recommend to people.
PINO: Okay, thank you. Thanks for the detail. I'll check those out.
CLAIRE: All right, so now I want to go back to that anti pattern question from Daniel. So, he was just curious if there are classes of patterns, so let's flip it from an anti pattern question to a pattern question.
Are there classes of patterns that are typically done in the application, but which can, or maybe should, be pushed down into the database?
VIK: One example that I've had experience with is finding, equals. So for example, in the case of, Olympic medals, if you usually get a gold, a silver and a bronze, but if the top two are exactly equal, then you get a gold, a gold and a bronze, you don't get a silver and programming that I have noticed is not the easiest thing to do.
Whereas in SQL, it's just a rank. And rank will give you, exactly what you need. It'll go one, one, three. And if for whatever reason you want a gold, a gold and a silver and no bronze, then you can use dense rank and it'll give you a one, one, two. That's something that I've seen people try and implement over and over again.
Whereas it's just one simple function in SQL.
CLAIRE: What do you do when there's a tie for, for bronze? And the rank would just be one, two, three, three.
And I believe that's how the Olympics do it also, but even if not, the example I think is still valid, that if you have five people who are all tied for bronze, then they should all get bronze medals.
PINO: Vik, what's the underlying principle here or why is there this goodness in SQL? Is it because you've got community working together on definitions of the function, not necessarily the implementation, because you can imagine that any language could give you libraries to do that have the same kind of variety of. In power, of definition of functions that you might need in these cases.
VIK: Sure. But suppose that you only want, the top three. So you only want gold, silver, and bronze, even if there are five bronze winners with SQL, you can just say, okay, give me all this. And then you can save fetch first, three rows with ties.
And then you might have 10 rows in the output, even though you only asked for three, but that's because there are, are ties. And so you, you want to get that data. If you do that in the application, you have to get all the rows or at least keep asking for rows until you can find that, okay, here's a fourth place.
Let's not include that. And let's stop asking. But that's something that you have to code. Whereas in SQL, you just say fetch first three rows of ties.
DIMITRI: Yeah. If you're very unlucky, you have 1 million rows in the database and everyone has been having exactly the same results. And so your first one is one million guys.
There is no way in the application code knows how many rows to retrieve.
CLAIRE: Okay, so that's a good pattern where it makes sense to query the database or do that calculation in the database, if you will. Calculation may be the wrong word. Are there other patterns? That are worth sharing or talking about. And by the way, I love the fact that you put it in the, the construct of Olympic medals because it just makes it so easy to visualize that scenario, Vik.
PINO: Can I maybe ask about a specific scenario since, I know this is relevant to your question, Claire, in terms of patterns, I saw, Dimitri, in a few different talks, educating, the audience about window functions. And I wanted to ask why is it, you know, did you see that as a gap that folks weren't using it?
And it's very powerful.
DIMITRI: Yeah, exactly. The thing is, I think Vik alluded to it, earlier, with the SQL standard, it started, I think the first one was 83, then 89, and then 92 has been the most famous one. But it didn't stop there. The current SQL standard is very recent, 2023. Wow. Okay. That you cannot do more recent than that.
VIK: Yeah. June 1st.
DIMITRI: Oh, thank you, Vik. Okay. So I need to read some more now about that stuff. But, if you don't pay attention very quickly, you're, you're missing out. But most people, what we've been taught in the, in university, so I'm a little old now. So maybe my example, my personal experience is not that relevant anymore, but when talking to people, it seems to be still to this day.
That most people are being told SQL 92 and sometimes SQL 89, which is more than 30 years ago. And at the time, I don't think a window functions were included. Can't remember when they appeared.
Say again, sorry.
DIMITRI: 99. Yeah. So if you, if you've been told 82, the version of the SQL which is. Depricated, by the way. Each time there is a new one, all the previous ones are depricated. So you shouldn't learn them. But, just learn the current one. But most people are taught 92, and so they don't know about window functions, and they don't know about lateral queries and things like that.
And window functions are pretty magical because they allow you, from within a SQL statement, to have a look at the other rows they called peers. But you can have a look at the rest of the results from within the query itself. And that's a deal breaker for writing SQL queries. And in many, many cases, people are, , sending a query to the database, getting back the result, and then processing the result to get the final, results they wanted.
And all of that because they didn't know about window functions.
CLAIRE: But what do you want people to remember about window functions, Dim? Like, you get a short elevator ride with someone for just a couple of floors, and that's it, to teach them. What do you want them to remember?
DIMITRI: Well, did you know that in SQL you can write a query where you both retrieve any number of rows and also the total number of rows and even per group.
So you have the individual rows and also the counts per groups. For example, a top three, like, or the number of things per week or per day. You have statistics per day and you want all the days, but you also want the sum per week. You can do that in a single query.
So you can compare this Monday with last Monday and, how much of this Monday is responsible for the weekly results and things like that. That's a single SQL query.
VIK: And this comes back to telling the database what you want and not how to do it.
I saw in the chat that, Robert Treat said that the way he considers it is a declarative language is describing the problem, and an imperative language is describing the solution, and that's a very good way to look at it.
DIMITRI: You stay in the problem space. If you're not used to it, it's hard to do, but,
CLAIRE: but once you get used to it, it probably is a superpower. Kind of what I hear you both saying.
DIMITRI: Well, it makes SQL easy, which is, which then it's part of your toolbox and you can just use it when you need to.
CLAIRE: Okay. All right. So back to my pattern question. So, so we've got the Olympic medal example of how that's done effectively efficiently in the database.
And now you've talked to us about window functions as another pattern that we want people to know about that where it makes sense to query the database for that. Are there other similar patterns?
VIK: Well, there's one that I am very fond of, but, unfortunately it is not yet implemented, implemented in PostgreSQL, which is row pattern recognition.
And that is something that is it in, in the standard. And that is also another game changer, just like window functions were, where, you can basically run kind of the equivalent of regular expressions on the rows that you get back. And so you can say, find me a pattern in all this data where, so to take the example of, that the standard itself gives.
If you've got stock values, you say, find me a pattern where it starts at some, at some point, don't care what it was. And then it goes down to another point and then it goes up and then it goes down again and then it goes up, but higher than the original point. That's something that is extremely difficult to do in, in anything really.
Yeah. In your application code, but it is a very simple thing to declare in SQL code. And, that is part of the SQL standard, but not yet part of PostgreSQL.
CLAIRE: And is anyone working on that?
VIK: Not that I know of. Well, no, that's not true. Tatsuo Ishii is working on it, in a very specific case, but once, once he gets it done, then it can be, generalized to the, well, to the general case.
So yes, some, Tatsuo Ishii, and I forget his name, I'm sorry, who is reviewing it, and also wrote a little bit, of it. Yes, people are working on it.
CLAIRE: Okay, awesome.
PINO: Does the standard, does the size of the standard ever become a problem? Like, you just keep adding functionality at some point. Is that problematic?
Do you need to deprecate some other functionality? Or how do you manage that complexity?
VIK: So this is the question about the standard itself, right?
PINO: Yes, but I guess it applies to the implementations as well, because they have to become that much bigger, and maybe there are interactions that make the implementation, multiplicative in size.
DIMITRI: Yeah, that's what PostgreSQL is very good at, if I may. The pick of what you want to implement from the standard is usually done because someone needs to actually use it, and someone is motivated enough to implement it. And so the parts of the standard that are not very useful to anyone. They tend to not be implemented at all.
That's the way we reduce the complexity, I believe.
CLAIRE: I mean, when I think of Postgres, I think of, in some sense, bottoms up decision making. Right? There's, there's no central team that's deciding on the roadmap for what's going to get done or not get done in the next releases of Postgres, right? So what you just said makes a lot of sense, Dimitri, it's all about, you know, who is working on Postgres and has an itch that needs to get scratched.
And if something new in the standard will serve, you know, the use cases they care about, then it'll get done. Did I just paraphrase what you said, right?
DIMITRI: Yeah, I think you did. The trouble is that some of the areas of the standard that are very interesting, like the one that Vik mentions, sometimes they're very hard to implement.
And you need to wait until someone is motivated enough to swallow the big complexity of the development before you get that in Postgres. So the fact that it's not in Postgres alone doesn't mean that it's not interesting. It could also mean that it's very interesting, but very hard and requires more investment that people are ready to put into it at the moment.
VIK: And sometimes the standard defines things in order to define other things. For example, assertions are basically check constraints, but for the entire database and there is not a single implementation that implements them because they're extremely hard to do. And because, there are security reasons not to do it, but on the standard side, saying that the database must be in this state is a very easy way to, to specify something.
So some things in the standard are specified. Only to specify other things and nobody has actually implemented them.
CLAIRE: Okay. So for someone who wants to learn SQL, do either of you have like favorite resources for recommending, and I'm thinking about application developers when I ask this question, how should they skill up in SQL?
And let's focus on the Postgres side of things, right? Not talk about other databases per se.
VIK: Yeah, there's something I, unfortunately, I don't know the name offhand because I don't need it, but it's, there's something called PG exercises or Postgres exercises or something like that. Which gives, well, exercises in learning SQL syntax, Dimitri, you might know what the name is.
DIMITRI: Yeah, I think you're right with the PG exercises.
CLAIRE: Yeah, pgexercises. com, I think is what it is.
VIK: Something like that, yeah. And, and there are also games. There's a murder mystery that you have to solve by running SQL queries.
CLAIRE: On that particular website?
VIK: No, it's something else. Let me look it up while Dimitri answers this question.
PINO: Love that idea of solving a murder mystery while programming. Yeah. Plus one to that.
DIMITRI: I think it comes from the Prolog ecosystem, or in Prolog, it's the kind of thing you do as a one on one. Anyways, so for me, as a resource, I would recommend my book because that's the reason why I wrote it in the first place.
CLAIRE: Okay, and I just want to give that a bit of a plus one too, like, I can't tell you how many people have recommended your book to me.
So I think when you're recommending it, you're doing it with like, not selfishly, right? Because it, it really is useful and it has a new name now, right? It's Art of PostgreSQL.
VIK: The Art of PostgreSQL.
CLAIRE: yes. The Art of PostgreSQL. Okay.
DIMITRI: Yeah, that's the name. So, but the other thing I like to tell people is that, most application developers, they work in the context of an application, obviously, and that application has data somewhere.
So my advice for them is, like if you want to grow your skill set in SQL, one idea that I like is that, each and every day grab 15 minutes, maybe half an hour. Think of, you know the curiosity you may have about the data set that you have in production or many customers are using that feature if you register that or, what is the distribution of our customers given this criteria and, what's the evolution of this number that we have for customers monthly, you know, in a monthly basis, etc, some of those queries are marketing queries, but basically just diving in the data set, just think of a way to find interesting bits and pieces in your data set and try and solve that using SQL and only SQL. So what you do is you open typically a psql scripts, but if you're, if you prefer pgAdmin or something else, or a web app, whatever you do, just a simple tool that allows you to write a query and see the result and try to answer one interesting question a day using only SQL.
And usually if the question is interesting, the bits and pieces that you will need from SQL are going to be interesting too and advanced enough that you will learn new things every day. Because in my experience, whatever the domain, If you want to learn new skills, you need to do it every day for a little, that's how you grow up, that there is no, that's why my book is called, The Art of PostgreSQL or before that it was Mastering PostgreSQL because that's the idea behind artisan in French, which would be craftsmanship and or mastering, being a master at anything you need to do it every day.
CLAIRE: So it's kind of like exercise.
DIMITRI: Yeah, that's the only way that you're going to be good at something. Do it every day.
CLAIRE: Build the muscles, build the habit. Is habit an important part of it too? Or is it really just about building those skills and familiarity?
DIMITRI: I think you need to get to a point where when you have a new problem that comes in your way, you immediately know if it's meaningful to do it in SQL or not. If you're still asking yourself the question, I could do it in SQL, but I don't really know how, and it looks painful to do, and I'm going to spend one full day, and I would rather write code in my usual language.
Well, then you're not there yet with a SQL. Then it's not a choice It's just a burden I think you need to get to the point where it's it's an actual choice meaning that you know exactly how to do it in SQL and then you choose to do it in SQL or in something else but until you have that skill level that allows you to make it an explicit choice in your mind, like I could do it that way or this way and my preference today is this way. Well, then that's cool. If you don't know how to do it in SQL, well, then it's not a choice anymore, is it?
VIK: I agree. It's not like riding a bike where once you learn, you know it for life. It's like playing piano where once you know a song, then you learn a new song.
CLAIRE: Aaron just wrote in the chat that, he's comparing this discussion to learning the guitar and that doing something on the guitar each and every day kind of helped him get further along in that journey.
VIK: Yeah. So I said piano, but guitar is the same thing. It's any instrument or any skill, except for something simple like riding a bike.
Unless you want to do like trick riding, once you learn to ride a bike, once you learn SQL 92, you're not set for life, even though you can always fall back on it. It's like, oh yeah, I know how to do that. If you don't know SQL 2023, or if you don't know other ways of doing things, then like Dimitri said, you don't have a choice.
You have to do it in the application when that's possibly not the best choice to make.
PINO: Hmm. And are we talking about does your recommendation include, or does this recommendation include learning about extensions and the functionality they bring, or are you really just talking about SQL implemented in Postgres?
VIK: Well, it depends on the extension. Because there's like the Postgres extension, which is amazing. And that can do all the routing and all the geometrical and all that stuff. But mostly the language itself is what I'm talking about. Dimitri might have other opinions.
DIMITRI: Yeah, it depends on what extension you're mentioning.
Like, like Vik said, because, a lot of the extensions are just, extensions to do, to SQL itself, meaning that you just keep writing SQL, but the extension is going to implement like an operator on the data type for you, to do geolocation based thing or, look up an IP address in an IP range, things like that, that are not very easy to do with just, let's say, standard SQL, but, you just use the SQL as you are used to with a new, data types and operators, which makes SQL looks a lot like an object oriented programming environment.
And I love that. But for me, if you do SQL each and every day, it would be obvious to you at some point that well, you could almost write it in SQL. If only you had support for a this behavior that is specific to a data type that you don't have in your library in Postgres yet, and then maybe you just do a quick, search and find an extension that implements exactly that and then you up along and do it.
So, for this kind of an extension. I believe it comes naturally once you're skilled enough in SQL, you realize immediately that it's just a SQL query, but with the plus sign means something different or the contain operator for an array, for example, you want that for a range of IP addresses. So is there a way to store an IP address in Postgres and then you find the extension
VIK: Or take a modern example, the pgvector stuff for AI?
DIMITRI: Oh, yes. Yes.
CLAIRE: So Vik, before we leave this question about how do you recommend developers skill up in SQL, what your favorite resources are, you mentioned pgexercises dot com and then you mentioned that SQL Murder Mystery, which you had to then go dig up for us. Are there, and then we left you and we went to Dimitri.
So I wanted to circle back to you. Do you have other favorite resources that you recommend to, app developers who run on Postgres to learn more about SQL?
VIK: Yeah, I think that attempting to do, Advent of Code every December in pure SQL will blow up your experience in a positive way. And really, I mean, it did that for me and I think it'll do that for, for anyone.
CLAIRE: But were you a bit of an expert already? Like, is that really something a beginner could do?
VIK: Why not? And if they can't do it, then look at my Git repository or look at other people who do it and say, oh, how did they solve this problem with that? And keeping in mind that SQL is not always the best tool for the job.
And so sometimes you shouldn't be doing this in SQL, but I mean, why not? It's an exercise. It's something that you can try and do.
CLAIRE: Your Git repository, meaning the one with the name that I can never pronounce, is it like Chocolatel? Is that how I think of it?
VIK: Yes. Okay. Got it. It means chocolate in Aztec.
CLAIRE: Oh, well now I know the source. Okay. Vik's GitHub repo: I'm dropping it into the chat right now. Okay, awesome.
DIMITRI: I wanted to react on what Vik was saying for, for Advent of Code in particular, it's all about training your muscles to do things that maybe you will never do in the, in the real life, but it's back to being able to understand how far you can go with a SQL, so that later on you can have a choice about doing it that way or another way. But with Advent of Code, usually it's pretty obvious that, it's not the best choice, but it's still interesting to solve it that way because you learn a lot.
VIK: And sometimes it adds things to Postgres, like for example, one year I needed, the greatest common denominator and I had to do that in SQL and I thought: "this is not good."
And so now Postgres has a GCD function, because of that.
CLAIRE: So when you're doing something like Advent of Code, or maybe you just want to give yourself some exercises, there's a question in the chat about whether there are any open data sets that either of you particularly enjoy using?
VIK: Oh plenty. Dimitri is the right person to answer this one.
CLAIRE: Okay. Talk to us, Dimitri.
DIMITRI: Yeah. I had to do that for the book, find data sets and, there are so many out there. So you can find all the NBA games for instance, or, all the Formula One championships and results, with all the accidents people had driving the cars and the frequency of the accidents, etc.
So you can dive deep on that. You can find music databases that people have shared. One thing I did that I found pretty funny, all the text from Shakespeare's books or, the theater plays, are also commonly available. So you can grab that and put that in the database and play with that if you want to.
So there are a lot around there.
CLAIRE: The music databases, does that include, like, lyrics?
DIMITRI: I'm sure you can find some databases with the lyrics too.
DIMITRI: I'm sure you can. So, I've been using Last.fm. They did provide a couple of, exports from their databases back in the days. Others are available too.
VIK: And there's also OpenStreetMap.
DIMITRI: Yeah, OpenStreetMap is a big one for our GIS stuff, but that's also why.
CLAIRE: Yeah now, OpenStreetMap runs on Postgres, is that correct?
DIMITRI: And that's also why I mentioned every application developer, usually because they have an application to take care of on their, on the daily job, they also have access to the database with a lot of interesting information already.
So if you want to do that, in your overtime, and you want an open source database that's very nice to have, but maybe you don't even need that. Maybe you just already have one very interesting databases that you never bother to actually invest into what's in there.
PINO: So since it's 2023, and it's the year of chat, of generative AI, I have to ask, you know, do you think it's a good habit to maybe, when you're doing your engineering, you're working on your project to always ask... you know, if you don't know it off the top of your head, ask a Copilot.
Can this be solved in SQL or how would I solve this in SQL?
VIK: Yeah, just go ahead and ask, but don't trust it.
PINO: Don't trust it. Sure. Sure.
DIMITRI: Yeah. The thing with the tool is that it's an awesome tool if you are able to understand the answer. So, because otherwise it's just a wasting time really because 80 percent or maybe 90 percent of the answer is going to be good and you, you will have to trash and rewrite 10 to 20 percent of it.
In my experience, the problem is, if you don't know what are those 10 to 20%, you're going to burn a lot of time to try to understand what it doesn't work, but I don't think you can just use copilot to do it, but copilot is going to be very good to get you to the 80%. It's going to be great if you know what those 80 percent are.
CLAIRE: Well, and that's kind of what I'm hearing from other developers, the ones who have spent a lot of time to understand things like Copilot and what works well and what doesn't. They speak very positively about it, but maybe it's because they've been building up the skills. To, like, optimize and make finding that 20 percent more efficient. I don't know.
VIK: Oh, it's not maybe.
CLAIRE: Oh, you'd say it's definitely?
DIMITRI: Imagine that you want to write an email you want to be very sharp on the language and the implications of it and you want something very good and you're not in the mood to do it yourself and you ask Copilot to do it.
Would you just send the email template that Copilot is going to give you without proofreading it?
CLAIRE: Gosh I hope not.
DIMITRI: Yes, so that's the image I have in mind. In that case, Copilot is going to write like 80 percent of the email exactly right with the, if you ask it for the right tone and the writing and etc etc, it's going to be amazing at making a template that you can use which is very good in a very small amount of time.
But you still have to do this, the last bit of it and the last 10 to 20 percent. You want to control that, and you need to know what they are, those 20 percent that you need to change.
CLAIRE: We had a, not a dry run because these, these conversations are not scripted. We haven't practiced them. We haven't even discussed the questions out loud, right?
Beforehand, but we did, we do our sound checks and everything in advance. And, when we did that, I guess it was like a week and a half ago or two weeks ago. We had a little bit of confusion about the time zones because if I'm correct here in the United States where I am, we have not yet switched time zones, but in France where you both are, you have done the fallback thing, for the year.
And so we were like confused about, well, wait a minute, it's going to be 10 a.m. Pacific. What time is that going to be in France? And I just got a kick out of how you answered that question, Vik.
VIK: Well, I asked Postgres. I just said, select date and time, America, Los Angeles and since my server is configured for France, it just gave me the time in France.
But I could have done, select date and time, America, Los Angeles at time zone, Europe, Paris, and it would have given me the same thing.
PINO: Or a date in the future.
VIK: Oh, yeah, any day at all. Right. Well, I mean the future is uncertain because who knows, what laws
CLAIRE: Yeah, yes and laws may change differently in different countries
VIK: Yeah, but for the past we have an extensive database within a database, for what the times were and what the time zones were.
DIMITRI: Yeah. We need to understand that time zones are not a scientific or a factual thing. It's a political thing, right? It's all about politics. So the complexity of time zones is a human complexity. Like, how many time zones do you think we have in the, in the planet? The answer is not 24 because yeah, no, it's not done before it's, I think it's 27 or 28 something like that.
Some have some islands in the Pacific. If you only have 24 time zones, they're used to do commerce together, but no, there are one day apart. So when they close the shops on Sunday, Sunday is not the same day because they, you know, they are on minus 12 and plus 12, so they don't share the same Sundays.
So they said, can we have minus 13, please?
VIK: If you look at a map of the international deadline, it is not a straight line.
DIMITRI: No, it's not.
PINO: You think it'll be a lot of work if some countries abolish daylight savings time, will it be a lot of work in Postgres to fix that?
CLAIRE: It'll just work, right?
VIK: Well, it'll just work after 98. I said that there's a database within the database. So once the timezone database is updated, then Postgres will just work with no changes.
CLAIRE: Very cool. Yeah. I saw a talk at Swiss PGDay a couple of years ago. I should drop the link in the chat and it really opened my eyes to all of the timezone complexity, which is not database complexity, right?
It's about like Dimitri was saying, politics and humans and governments and, you know, changing policies, oh you know, undoing of previous laws and flipping back and forth. And it was a mess.
VIK: One thing that I used to like to say in training, which I did in France was ask the students how many time zones there are in France.
And their answer almost unanimously came back as one. The actual answer is 12. Because France has, things in South America and in Oceania and all that. And so they say, oh yeah, well, you shouldn't be talking about that. I said, okay. So in metropolitan France, how many time zones are there? And they all said one.
Well, that's not true. There are two.
DIMITRI: One at a time, but two of them.
VIK: Yeah. For example, just three days ago, we changed time zones.
CLAIRE: And I say C E S T, C E T. Okay.
VIK: Got it. Those are two time zones that France uses. So how many are there in the States? Well, if you just take continental U. S. Most people would say there are four, but there are actually eight. Well, actually there are more than that because Arizona doesn't change and things like that.
But, there are at least eight in the States. And then don't forget Hawaii, but
DIMITRI: Yeah. And, I want to react to something that I'm reading on the chat, which is sarcasm, but, I think, talking about it is a good thing. Just use timestamp without time zone and pretend they don't exist.
So. If I remember correctly, I've stumbled against only one use case where you want to use a timestamp without a time zone. And that's when you manage a calendar. And if you take a, if you have a meeting at, 10 AM after the time zone change for DST, You still want to meet at 10 AM, right? You don't want the meeting to change to 9 AM because you changed the time zone politics. So that's the one case where you don't want a time zone in your, timestamp. Any other case I've seen you need the timestamp to be computed.
VIK: There's one other case that I came across, which is if you're getting data from a different source, that only gives you timestamp without time zone, but also changes it, like for example, a bus schedule, coming from Google, I think was the original case, where it changed its time, but didn't give a time zone.
And so if you need to store that, then you have to store it without time zone. But the cases are few and far between.
DIMITRI: Exactly. So as an application developer, if you're listening to that podcast and you're wondering, just use timestamp with the time zone, please.
PINO: And I'd like to drop in some, a little bit of, trivia to say that I learned from my eighth grader this year that, time zones were standardized, because when the railways were developed, because people had to read timetables.
DIMITRI: Yeah. And not only that, but, in France, typically, Paris to Brest is going to be 45 minutes apart in terms of the, the sun position in the, in the sky. And so when, before the train, you would have a local time and, even in, in a small country like, France mainland, you didn't have one time zone, every city at their own time.
So that noon was when the sun was in the middle of the sky. That's it. But as soon as you, you need to hop on the train on time, you need to define what time is it that you need to hop on the train. And so, yes, then you need all the cities. to have the same time. And then it's not practical to do that.
And so you end up with time zones.
VIK: Yeah. And this is the reason why, when you're in Europe and you're selecting a time zone, you have to choose between, Europe, Paris, Europe, Brussels, Europe, Amsterdam, Europe Madrid, because those used to be different offsets. Today they're not, but they used to be.
CLAIRE: And who knows, maybe they could be in the future.
VIK: Hopefully not, but they could be. Well, I mean, maybe Spain should go over to minus one, but, yeah.
CLAIRE: Okay. So, Dimitri?
DIMITRI: Yeah, no, I was going to say, some people might be smarter than the other ones and, stop using, DST changes.
CLAIRE: Yeah. Well, I thought that was going to happen for where I live this year, but it didn't.
So I'm not going to hold my breath.
VIK: Well move to Arizona.
DIMITRI: Always a simple solution.
CLAIRE: Okay. So before we wrap, I have to ask all three of you, Pino and Vik and Dimitri for this topic, how to solve every data problem in SQL. Is there anything else we didn't cover today that you wish we had? Cause we're running out of time here.
That maybe would make good fodder for a future conversation.
VIK: I don't think so, but I will say again what I've said at least twice now, you shouldn't solve every problem with SQL. SQL is a very good tool for what it does, but it's not, it's not the silver bullet. Some problems still need to be solved in other languages.
DIMITRI: Yeah. Which is a perfect intro for what I wanted to say. Thank you, Vik.
The only way to make a conscious choice of using SQL or something else to solve a problem is when you know how to solve it in SQL. The fact that you don't know how to solve it in SQL makes it a non choice. So if you want to have the choice, you need to learn enough of SQL to figure out how to solve it, or at least if it's possible to solve it.
DIMITRI: And, and the topic we didn't address today that I think, is very interesting and closely related to, everything we said is with a normalization. Some people, most of the people I've been talking to in trainings and etc, they would tell, yeah, I know my model is a little normalized and, usually it just happens to be that way. Nobody take the time to normalize the schema and then figure out that for that very case, it would be best if it was a little denormalized. Usually they just didn't bother with normalization at all. But with Postgres in particular, normalizing a schema is usually a huge benefit.
And again, not always, but if you know how to do it, well, then you can choose not to do it.
CLAIRE: Okay. So future topic possibility: "Normalizing schemas, what your mother didn't tell you."
VIK: Yes. And I would be happy to join you on that.
CLAIRE: Awesome. Okay. Well,
VIK: One more thing, if we have time is that, Dimitri said, and many people say, that SQL is a relational language.
It is not. [SQL] is both more than a relational language, and also less than a relational language. The philosophy of the SQL committee is, if you have a data model that you need to query, SQL can do it. And if it can't, then we'll add it. And that is evidenced by all of the JSON stuff that we've added. And also recently PGQ, which is, Property Graph Queries. And so within SQL, you can, you can do graph queries now, in SQL itself, not in Postgres yet. So it's not correct to think about SQL as a relational language. It had its foundation in that, but that's not what it is today.
DIMITRI: Yeah. I second that.
I want to clarify what I meant is that you need to understand the relation.
VIK: Of course. Yes.
DIMITRI: To really benefit from Postgres. But, that's not all you have in Postgres and set and bags and things like that. We alluded to that quickly, but yes, it's not fully relational and it's not only relational.
Yeah. Thank you, Vik. That's, that's a very good point.
PINO: I guess I, since I was included in the question, I'll say, and this is related to what Vik just said, or I'm asking if it's related to what Vik just said. I found myself wondering if PostgreSQL, and the SQL standard, how much they've been influenced by data processing pipelines, and you know, just as in the example where you add graph functionality. Just as an example where you get graph capabilities in PostgreSQL, you know, is that something that's been happening because of other data processing engines, streaming or otherwise.
VIK: Yeah. Well, streaming is a whole different topic, but, yes, things like Neo4j, and, and their Cypher language and everything are bringing graph queries to SQL.
CLAIRE: All right. I want to say thank you to all three of you and to everybody who's listening to this podcast, either now live on discord or in the future on all the podcast platforms. It was a delight to have, both of you. Dimitri Fontaine and Vik Fearing join us as guests today. And, this podcast for those who are here live will be published online on all the platforms.
Usually in about 48 hours-ish. So sometime on Friday, depending on your time zone, of course. So, thank you. Thank you both very much. Really appreciate the conversation today.
VIK: Yes, Thank you for having me.
CLAIRE: Thank you to everybody who is here on the chat. And I also want to just highlight that the next episode is going to be recorded live on Wednesday, December 6th at 10 AM PST because we will have changed time zones here in the United States by then the guests and topics are going to be announced soon and you can mark your calendar with a link will drop into the chat: https://aka.ms/PathToCitusCon-Ep10-cal
PINO: and before we close, we want to ask you a favor, especially if you've enjoyed the podcast.
Please, please rate and review us on your favorite podcast platform. It helps other folks find this new show. And a big thank you to everyone who joined the recording live and participated in live text chat on Discord.