Subtitles section Play video
DAVID MALAN: We've seen how with languages like Python
can we implement business logic on a server and, even if we want,
generate web pages.
And we've seen, on the browser side, the so-called client side,
how you can render information or data to users.
And with CSS can we style it, and with JavaScript can
we even make it interactive.
But when building a business or any application that
is interacting with users, where is all that data being stored?
Well, if you're running your own small business,
you might simply be using a spreadsheet.
Of course, spreadsheets, whether they're implemented
as Microsoft Excel, or Google Spreadsheets, or Apple Numbers,
generally stores data in rows and columns.
And if you're like me, you probably use that first row
to represent the names of the columns and the types of the data
that you're storing.
And as you continue to add more and more data to that spreadsheet,
you probably, like me, continue to add row after row after row.
And meanwhile, if you've got so much data or so many different types of data
that it doesn't really belong in one sheet,
you might actually create a second sheet or a tab
along the bottom of the screen, for instance, in which you
can store additional data as well.
So indeed, Microsoft Excel allows you to store data relationally, so to speak.
You might have one sheet with customer info, another sheet with invoices,
another sheet with products, or any number of other types of data.
And you can relate those to another by somehow having some commonality
among them, some identifier, some name, some customer ID or the like.
And so data that tends to be stored in rows and columns
can have these relationships.
And you can do this certainly in the cloud these days
as well with Google Docs, simply a web-based version of the same.
But when designing your spreadsheets, or more
generally, your data's schema, the design thereof,
there's a number of questions you need to consider.
What data do you want to store?
And where do you want to store it?
And what type of data is it, in fact?
Because whether using Excel, or Numbers, or Google Spreadsheets,
odds are you sometimes configure the columns to be displayed or formatted
in some ways.
Numbers might be displayed with or without commas.
Numbers might be displayed with dollar signs or some other symbol.
You might separate one field from another
using special symbols or punctuation.
So these are all configuration options in a spreadsheet that might
be among the first decisions you make.
In fact, let's consider a sample spreadsheet
wherein I might want to represent users in some web application.
All of us have, of course, registered for websites or applications
these days.
And what are some of the questions you're asked?
Well, you might be asked for your user name by some application.
You might be asked for your actual name.
What else might you be asked for?
Perhaps your email address.
And perhaps if you're buying something, or registering
for something, or someone that needs to know a bit more about you,
perhaps they'll even ask you for your address, maybe your phone number,
or perhaps even your age.
So you can certainly imagine there being even other fields that you need.
But let's consider now what types of data each of these fields is.
Now, what should a user name be?
Well, by definition, this should be a unique value that belongs to you
and only you on a system.
For instance, if I'm registering for a website
and that website is storing some data in, for instance, a spreadsheet,
I might ideally try to claim Malan if it's available.
My name meanwhile, of course, will be just
David Malan, which someone else can certainly have as well.
Hopefully my email address is only mine.
And that, too, will be stored in a column of its own.
And then my address, for instance, here on campus, 33 Oxford Street, Cambridge,
Massachusetts, 02138.
A phone number by which I can be reached, 617-495-5000.
And then my age shall be--
well, we'll leave that blank just for now.
So notice how each of these pieces of data is just a little bit different.
Some of them seem to be numeric.
Some of them seem to be alphabetic.
Some of them even have some punctuation therein.
Now, you might or might not in a spreadsheet
care to display this data a bit differently.
Indeed, the only number that I'm sweeping under the rug
here is, in fact, my age.
But hopefully I don't need one or more commas in that value.
So there isn't really any need for special formatting here.
But it turns out that when you're actually
building a website or software-based application,
you're probably not storing your data ultimately in just spreadsheets.
You might graduate eventually from being a small shop to needing more than just
Microsoft Excel, or Numbers, or even something
cloud-based like Google Spreadsheets.
Why?
Well, you have more rows than those programs can generally handle.
In fact, on your own Mac or PC, odds are,
when opening up big files, whether a spreadsheet or any other document,
sometimes you might actually feel that.
The computer might start to slow, and you
might start to see a spinning beach ball or hourglass
because you start to hit the limits of what
a local computer can do with just client-side software,
like a spreadsheet tool.
And so eventually you might actually need to use a proper database.
And a database is really just a piece of software
that can absolutely run on your Mac or PC.
But very commonly it runs on a server or somewhere else
in the cloud to which your own software connects.
For instance, if you're building a website or an application
in a language like Python, you can, in Python,
write code that talks or communicates with that database to pull data down
and to send data back up.
But spreadsheets are wonderfully straightforward.
It's just so intuitive how you store the data in rows and columns
and just more and more rows as you have more and more data.
And so what's nice about some databases is that they mimic exactly this design.
There exists in the world what are called relational databases.
And indeed, this is among the most common ways to store data relationally.
The jargon is a bit different in the world of databases.
You no longer call these things spreadsheets.
We call them databases.
And we don't call the individual tabs sheets.
We call them tables.
But we continue to call the structure of the data therein
rows for each additional piece of data and columns
for the different types of data that we have.
But in a relational database, such as Oracle, Microsoft Access, SQL Server,
My SQL, Postgres, or something smaller and lighter weight called SQLlite,
the burden is on you, the designer of the database,
or the programmer, or the business person
to actually decide on what types of data you are going to store.
Because the database, in order to handle more and more data
than a typical spreadsheet can support, needs a bit of help from you,
needs to know what type of data you're storing so that it can search it
more efficiently and sort it more effectively
and make it easier for you ultimately to add and remove data
from that particular database.
So in other words, you can do all of the same operations.
But generally, in a database, you have so much more data,
you're going to need to have the database help you help yourself.
So what does that actually mean here?
Well, let's consider some of these fields here.
Were we to migrate my spreadsheet with many, many, many rows of users,
for instance, to a proper database relationally,
I'm going to have to make a few more decisions as well.
And it turns out that I need to choose various types for these columns.
And generally, in a database, you only have a limited menu of options.
So let's take a look at what those might be.
Here in many relational databases are just
some of the data types that are available to you;
an integer if you want to represent something like 1, 2, 3, or perhaps
even a negative, a real number, otherwise known as a floating point
value, for instance, in Python that actually
has a decimal point and perhaps some numbers thereafter,
numeric, which is more of a catch-all and might handle things like dates
and times that are numbers in some sense but have some more
formal structure to them, and then the more general text, when you just
have words or phrases, characters or whole paragraphs
or more that you might want to store as well in some column.
And then lastly, you might have the cutely named BLOB,
or binary large object, when you actually want to store zeros and ones,
that is to say binary data, like actual files in your database.
However, you needn't do this typically.
You can actually store files, of course, on a file system on your own hard drive
somewhere on the server, but that option exists for you.
Now, this happens to be the list of data types supported
by the simplest of relational databases, something called SQLite.
And as its name implies, it actually supports a very specific language
via which you or anyone can request data from the database and store
data in the database, or update it, or delete it.
And that language is called SQL, structured query language.
SQL is a language via which you can do exactly
that, retrieve data from a database, put data into a database, and so much more.
It's an alternative to the pointing and clicking with which you're probably
familiar with something like Microsoft Excel, or Google spreadsheet,
or Apple Numbers, where it's really you, the human,
doing all the work via a graphical user interface.
Now, to be fair, in all of this spreadsheet programs
are there functions or macros that you might
be able to write so that you can program in those environments
but with some limitations.
And indeed with SQL and with SQL databases, more generally,
that is databases that understand this language, can you
achieve far higher performance, can you store far more data,
and nonetheless get at it quickly.
So it's sort of the next evolution of what you might otherwise do only
within your own computer spreadsheet.
But many other databases, Oracle, SQL Server, Microsoft Access, MySQL,
Postgres and more, support more than these data
types that allow you to help the database help you even more.
Because the more the database knows about your data, the smarter decisions
it can make at a lower level below these abstractions
and storing that data so that, when you have a question,
it can answer it quickly.
And so how might you refine what we mean by integer?
Well, in some databases, you might want to tell it
that this is just a small integer in this column one
after the other, row by row.
Or maybe it's just an integer somewhere a little bigger
than that, or perhaps a big int, which means it can be even larger.
Now, these actually map to very well-defined values.
If you think back to how we considered data itself is stored in a computer,
it's ultimately with zeros and ones or bits.
And indeed, an integer, so to speak, generally takes up
just 32 bits or four bytes.
But, of course, that few bits actually translates,
if you do a bit of the math, to four billion possible values.
So if you were to specify that a column in your database
is of type integer, that means you could type in the number from zero
all the way on up to four billion, or if you want negative numbers,
from negative two billion roughly to positive two billion as well.
But beyond that three billion or four, you're out of luck
if you're only using 32 bits.
Now, that might seem huge.
And that is a good problem to have if I have four billion users, or sales,
or anything in my database.
But for certain large companies these days, having four billion
or more records of some sort, transactions or logs,
not at all uncommon.
And for those purposes does there exist big int,
a bigger data type that uses 64 bits or eight bytes via which
you can count much, much, much higher, so high
that I'm not even sure how to pronounce that number.
But indeed, it should be large enough for almost all of our purposes.
So ultimately, it's not enough to say that the field in a database
is an integer but to specify just how big maximally that integer can be.
Now, fortunately, I think for my age we could probably get away with small int.
But indeed, with a small int, you typically have 16 bits.
So even then could your user be as old as 65,535 years old.
Now, what about real numbers?
These are floating point values with decimal points.
And here, too, you have fewer choices, typically, but along the same lines.
A real number would typically take 32 bits,
which allows you only some amount of precision, some number of digits
after that decimal point.
If you want to go even further and be ever more precise
can you use twice as much space, 64 bits or eight
bytes, and use double precision instead.
Well, it would seem that we sort of start with an OK
value and an even better value here.
But what's the trade-off, as there always is?
Well, it's fine if want to be able to count higher,
whether with integers or real numbers.
But you are literally going to pay a price.
Because if you want to count higher than four billion
in the context of integers, you're going to have
to spend twice as much space from four bytes to eight
in order to store those values.
And if you're never actually going to use values larger than four billion,
you're just spending twice as much space as you need to.
And if you don't really need to store floating point
values to terribly many digits of precision, you're just wasting space.
Now, for small databases, this might be inconsequential.
But for the Googles, and Microsofts, and others of the world
to deal in gigabytes and terabytes of data, all of these bits and bytes
will surely add up.
As for numeric, more of a catch-all numerically, so to speak,
whereby you have Booleans, zeros or ones, or false and true values,
dates which comes in a prescribed format, typically
year, year, year, year, dash, month, month, dash, day, day.
Indeed, even though across the world there
are different formats and different communities for representing dates,
in SQL do you specify them exactly in that format only.
Date time adds to that a space followed by hour, hour, colon, minute, minute,
colon, second, second, whereby if you want to store both date and a time,
you should store it in exactly that format in your database, even though,
thanks to code, whether Python or SQL itself,
you want to display that date in a geographically
localized and conventional way, you nonetheless
store it in this way in your database.
Now, sometimes you need to fight against the imprecision that's
inherent as a risk with real numbers.
And so you can use the so-called numeric data type
to specify a so-called scale and precision.
That is two numbers that specify really how many digits
you ultimately want to support.
So if you want to store things like dollar amounts,
maybe to cents or hundreds of cents, two or four decimal
places, can you specify exactly that, and you
will get exactly that much precision.
As for time and timestamp, well, time is quite simply hour, hour, colon, minute,
minute, colon, second, second.
But timestamp's a little more interesting.
It, too, can be used to represent time.
But as with many languages, Python and JavaScript among them,
it does so typically by remembering or counting
some number of milliseconds or seconds from a particular point in time.
In fact, some years ago, it was conventional to simply start
counting time from January 1st, 1970, the so-called epic as opposed
to say 0 BC or AD.
Unfortunately, it turns out that you generally represent these timestamps
with four bytes or 32 bits, which gives you access to four billion
or so possible values.
And it turns out that in the year 2038, we
will have run out of bits with which to represent time.
So if you recall the so-called Y2K problem when we ran into this issue
around the year 2000, so have we rather painted ourselves
into a corner in many systems by not allocating enough space
to get us past that year.
Well, what's the solution invariably going to be?
Well, use more space and, thus, cost.
But that is inherent in many systems.
And it's one of these things that us humans will have to address.
How about text?
Well, text can, too, come in multiple forms.
Whether it's English or any other language,
you might have different lengths thereof.
And so within the world of text values and databases,
need you specify typically how many characters
you want to store in that column.
And this is distinct from something like a spreadsheet program
where you can probably keep typing and typing or pasting as many characters
as you'd like.
A database, in so far as its purpose in life is to give you better performance
and more scalability than a spreadsheet allows,
wants you to tell it a bit more.
It's going to ask you to decide whether you
want to store a fixed number of chars.
So char followed by a number here represented at its end
means how many characters exactly shall you store in every cell in this column.
Now, why might that be applicable?
Well, in the US, for instance, for states,
we have a standardization of two characters
for every state in the country.
And so you might say char (2), thereby allowing
you to store CT for Connecticut, and FL for Florida,
and MA for Massachusetts and so on, because you know that every value is
going to be that finite length, two.
But sometimes you don't know the max length.
And indeed, in my own table of users, my own name might have D-A-V-I-D
and a space and M-A-L-A-N, so 11 for me.
But your name might be shorter or longer.
Or some user that hasn't even registered for our website
yet might have an even longer name than that.
So varchar, or variable number of chars, exists, too,
where you can specify not a fixed number of characters
to use but a maximum upper bound, perhaps something like 30,
perhaps something like 300, or some number of characters
that you decide that's going to be the upper limit of any human I actually
see in my sight who might have a name as many as that characters.
But varchar is smart and the databases that use it,
because it will use that many characters maximally.
But it's not going to waste space.
In fact, if you're using 30 characters for every name in your table,
well, it's only going to store as many characters are
necessary, perhaps plus a bit of overhead or accounting
to keep track of that value.
But that way you save space but still can handle large text or large strings.
Lastly, if you have particularly large text, whole paragraphs, or essays,
or documents that someone might paste in,
or the contents of an entire web page that you want to search,
well, you can have a text field more generally,
which tends to support tens of thousands or more characters in total.
But it does so in a way that's not quite as efficient.
Rather than keep it right in the column itself, so to speak,
it puts it somewhere else on the server, the result of which
is that you can fit more room over here, so to speak.
But it takes more time and effort to go search or find that data, so,
again, a trade-off of space and time.
But here, too, as with integers and real,
you seem to have multiple choices, one of which at first glance is better.
After all, why use char ever if you could also just say varchar
with that same amount?
For instance, for those two character codes for states,
why not just say varchar and not just char (2)?
Well, it turns out that if you promise the database
that every cell in your column will use exactly the same number of bytes,
it turns out that you'll have very straight or justified edges,
in some sense, conceptually for that column.
For instance, if every cell in a column takes up
exactly the same amount of space, then you might know,
if you start numbering the bytes that are represented by those cells,
this might be byte zero up here and then two characters away
will be address two, and then four, and then six, and then eight.
And if you think back to how algorithms like, say, binary search
are implemented, as soon as you have the ability to know arithmetically
where the next value, or the last value, or, best yet, the middle value
is, numerically can you jump right to it in so-called constant time,
thereby enabling something like binary search and ultimately logarithmic time.
So this is a powerful thing.
If you commit to the database that all of your values
will take up the exact amount of space, the database
can search that value faster for you.
Varchar, unfortunately, is rather akin to having only a left justified column,
whereby you might have data that looks very straight on one edge
with all of the characters left aligned.
But because the length of those characters vary,
this cell might be this wide, and this one this wide, and this one this wide,
and this one this wide.
And as such, because you have a very ragged edge, so to speak,
you can't just do simple math and add two and add two
to get to the next value in that column.
So in the worst case, if a column is only a varchar,
the best the database can do by default is just linear search,
or big O of n, which, recall, was notably slower than something
like log N or logarithmic time.
So the more help you can provide to the database the better off you are.
But of course you don't want to air too far on the other side
and say, oh, well, I'm going to have char 300 for every cell
in my Names column, because then you're using 300 characters by definition
for just David Malan who needs just, say, 11.
So lastly is there this thing BLOB, binary large object,
which can only store binary data, which can be used for images or files.
But those, again, tend to be best stored somewhere else on the system.
So with these possibilities in mind, how might we
go about designing the schema for this data?
We already have the names of my columns.
And I've already gone ahead and put in one sample row.
So let's consider in the context of all of those SQL types
which ones apply here.
Well, for user name, something like Malan,
it's probably not numeric because that's not
something I'd be so inclined to type.
But it probably is text.
But within text we have the choice of char or varchar
or larger blocks of text, more generally.
So what should this be?
Well, this one depends.
A user name tends to be fairly short.
Because after all, we humans probably don't want to type very long strings
just to log into some site.
Historically, maximal values of eight characters was common.
But that tends to be fairly constrained.
And you can no longer express very much of your name
if you tend to have a long name yourself.
So here we might say something like char, because we know it will be small.
And we'd like to be able to search on this field efficiently,
especially to log someone in fast, but probably not quite as
short as eight, so maybe 16 or 20.
Or if it's hard to guess there and you want more dynamism,
maybe you would say varchar of something like 255.
Why that value?
Well, recall that with eight bits can you count 256 possible values.
But if you start counting at zero, you can only go as high as 255.
And historically have many databases had limits of, say,
255 for a varchar, though that's no longer the case.
But you'll still see this very commonly.
So what's the best answer?
I don't know.
It depends on your data or someone else's.
For here, I'll go ahead and leave it as varchar (255)
just to give myself flexibility.
But a very good argument could be made for char of some smaller value
or even varchar of some smaller value, too.
As for name as well, I'm not really sure what this one should be.
I don't know how many characters your own name has.
Something like 11 is probably too few, even
though it satisfies my name just fine.
30 feels a bit [? type. ?] And frankly, I
bet if we Google longest name in world, something
tells me there's someone out there with quite a few characters, hopefully
no more than, say, 255.
But there, too, we might want to do some due diligence.
With email, too, this seems to be easy.
This, too, is just characters, even though you could certainly
have numbers.
I don't know how long the maximum email address will be.
But frankly, it's probably going to be variable in length,
so I'm going to go with a default.
Why 255?
Again, it's probably way more than we need.
But varchar is smart.
It's not going to waste that many bytes.
It's just a reasonable upper bound in the absence of better intuition.
At least we're being consistent.
Now, address, that might be decently long,
especially if it's going on envelope on, say, multiple lines.
But here, too, this is probably just text, and so I'll go ahead here
and say this, too, 255.
Phone number.
Finally, a number by name.
Well, what should a phone numbers field be?
Well, we had integer, or real, or, more specifically, small int or integer,
or big int.
But the funny thing here is even though a phone number is indeed
called by us humans a number, it's really
just a symbolic representation of a unique identifier for someone's phone.
And so these hyphens might be common in one community or culture.
Maybe you'd have parentheses or even pluses for country codes.
So frankly, very quickly does even a phone number not become so much
a number but a textual string.
So here I have some discretion.
And maybe I could be presumptuous and assume
I'm only going to have customers or users for now,
say, in the US for whatever business constraints.
And so I might say, you know what?
This is a candidate to actually do something like char, say, 10,
three-digit area code, a three-digit exchange, and then four digits
thereafter.
But that doesn't leave room for those hyphens, so I could make a char 12.
Or frankly, if they're just going to be there all the time,
why don't I leave them as char 10 and just
get rid of those values in my data?
Or alternatively, I could support parentheses or pluses as well.
It really depends on how I want to store the data.
But I like the idea of a textual type, something like char or maybe varchar
as opposed to an integer.
Because at least if I've called certain communities
or out from some businesses, sometimes you have to type unusual numbers.
At least in other countries, for instance,
if we generalize beyond this data set here do you
type zero to actually connect to someone local.
And the problem with zero is that, mathematically, it's
meaningless to start a value with zero.
And unfortunately, a computer takes that to heart.
And if you were to store phone as an integer
but a number were in some community to start with a zero,
your database would probably just get rid of it.
In fact, try that yourself in Microsoft Excel, or Apple Numbers,
or Google Spreadsheets.
And depending on how you formatted the column, you can type as many zeros
as you want followed by other digits.
And odds are, when you hit Enter, those zeros are gone.
As for age, here we perhaps finally have a compelling candidate for a number.
Small int should probably get the job done.
Integer would work as well, or big int, but increasingly wasteful.
But you know what?
Even here it's not that obvious.
I probably shouldn't even use an integer type here at all.
Why?
Well, I might indeed ask a human for his or her age upon registering.
The catch is that age might change the next day or the day
after, or the day after, because, of course, time is advancing.
And unless I also stored the date and ideally
time at which the user registered, I can't really
even do any math to figure out, oh, you registered a year ago.
Let me assume your age is that plus one.
So what would have been better from the get-go than age?
Probably just something like date of birth, DOB.
And, of course, in SQL do we have the ability
to store dates, even date times.
So here we probably have a better candidate
for exactly one of those numeric types, so to speak.
Now, we've only just begun to scratch the surface of available data types
and data we might want to store.
In fact, our spreadsheet or database could contain even more types.
But now let's just suppose that we're happy with our types.
And the goal now at hand is to start searching our data
and storing that data in an actual database.
I've been using here Google Spreadsheets just to lay out my schema
sort of as a worksheet, if you will.
But now let's graduate to actual SQL syntax
and suppose that this same data is stored
not in Google Spreadsheets or any other but in some database
elsewhere, a piece of software running somewhere else on my own computer
or somewhere in the cloud that knows how to organize this data, still
in rows and columns but in such a way that I can use this new language, SQL
or SQL, in order to access my data.
So what are the fundamental operations that
SQL supports, or a relational database, more generally?
It turns out that throughout computer science is there
a pattern of operations that many different systems support.
In the world of databases, you have fairly crassly
what's called CRUD, the ability to create data, read data, update,
and delete.
But more specifically, in the context of SQL, this new language, our last,
you have the ability to create data, select data, a.k.a.
READ data, update or insert data, a.k.a.
UPDATE, or delete or drop data, a.k.a.
DELETE.
So whereas in SQL you have these very specific terms,
they are just representative of a class of operations
that you might see throughout computer science.
So how do we go about using SQL?
Well, short of buying and installing something like Oracle
or downloading something free like MySQL,
we can simply use something that's indeed, by definition,
lighter weight called SQLite.
SQLite is an implementation of SQL, this database language,
via which you can install it on your own Mac or PC,
and it uses not special software or servers but rather just a file
on your own Mac or PC that stores all of those tables
and, in turn, rows and columns.
It thereafter creates an abstraction of sorts
as though you have a full-fledged server running
on your own machine with which you can communicate using SQL.
Now, any number of programs can be used to actually talk to a SQLite database.
You might use a purely textual interface, so-called command line
interface, using only your keyboard.
Or you might use something graphical, a GUI, a graphical user interface.
In fact, one of the easiest programs to use
is this one here, DB Browser, freely available for Macs,
for PCS running Windows, or Linux, or any number of operating systems
as well.
In advance, I've gone ahead and opened up a database in a file whose name ends
in .sqllite or also commonly .db.
This then is, again, just a file, a binary
file filled with zeros and ones that collectively
represent patterns of rows and columns and the tables that contain them.
And if I open this program, I see all of the tables or, if you will,
sheets from our old spreadsheet world.
This happens to be a database all about music.
And indeed, I have a table about musical albums and artists, customers
and employees, and genres, and invoices, invoice lines that
represent actual songs bought, the types of media involved,
and playlists and playlists tracks via which users can actually
customize those songs, and then lastly, the tracks or the songs themselves.
In other words, someone else on the internet
has gone to the trouble of aggregating all of this data
about music and their authors and organized it into an SQL database.
Now, how?
Well, if you look to the right here, you'll
see a number of esoteric commands all involving CREATE TABLE.
And indeed, CREATE is one of the four fundamental operations
that SQL supports.
And without going into the weeds of how these tables were created,
they were created in such a way that the author had to decide on every table's
columns' names as well as their types.
And that part is interesting.
For instance, let me go into the Albums table here and expand this one now.
Here I have Album ID, and Title, and, curiously, Artist ID.
But you'll notice in the Album table we have no mention of artist's name.
In fact, if we follow the white rabbit here,
we see in the Artist table that, oh, in Artist do
we have an artist ID and name.
And indeed, if we keep digging, we'll see that in each of these tables
there's not quite as much information as I might like.
In fact, IDs, whatever those are seem to be more prevalent.
And that is because, in the world of databases,
and perhaps even some of your own spreadsheets,
it's conventional and daresay best practice to normalize your data.
You indeed should try to avoid having redundancies within tables and even
across tables.
Now, what does that mean?
Well, consider our own spreadsheet of users
that we're trying to get into a database.
I had a column there called Address.
And I proposed that my own address was 33 Oxford Street in Cambridge,
Massachusetts 02138.
Now, it turns out that there's quite a few other people at my workplace,
and so they might have that exact same address as well.
And in fact, that address doesn't really lend itself
to very nice printing on envelopes because all I did
was separate things with a comma, not actually hit Enter.
And in fact, because everything was all in one column,
it would seem a little difficult to search by, say, zip code.
Find for me all of the users from Cambridge, Massachusetts in 02138.
Well, you could do it sort of searching free form
all of the values in some column.
But it wouldn't be particularly clean or efficient because you'd
have to look at and then ignore all of those other words and numbers, 33
Oxford Street, Cambridge mass, if you only care about that zip.
So before we forge ahead, let's see if we
can't infer why this database is seemingly, at first glance,
more complicated than you might like.
Well, if I go back into my spreadsheet here, what really should
I have done with Address?
I probably should have quantized that value
into streets and then city, then state and zip code,
not having just one column but several.
In fact, here, what I've gone ahead and done is separate out Address
into Street and City and State and Zip.
And for each of those have I specified a very precise type.
I've gone ahead and proposed that Street is still varchar (255), as is city,
because I don't really know an upper bound, so we'll at least be consistent.
For state, I've hopefully been smart in at least assuming users are in the US.
I've said char (2) for just that two-character code.
And for Zip, too, I'm preemptively trying
to avoid a mistake with even Cambridge whose zip codes start with a zero.
Where I had to specify again that that's just an integer,
I might actually lose mathematically that first digit.
But by storing it as a char with five characters and no hyphen or four others
can I ensure that that 02138 remains exactly that.
But here, too, we have a bit of inefficiency.
Imagine if more and more users from my own building
register for this particular application.
You might have again and again these exact same values
if all of my colleagues are in the same building.
And here, as with programming, as soon as you
start to see a lot of redundancy, the same data again and again,
there's surely an opportunity for better design.
This is correct.
My colleagues might also live here if we added
their names and their distinct emails.
But they don't necessarily need to have the same exact text stored
again and again.
So what might you do in this case even in the world of spreadsheets?
Well, on the sheet, I might just rename this actually more explicitly to users.
And you know what?
Let me go ahead and create another sheet in my spreadsheet world
and call this, say, Cities.
And my cities might actually have a city name and perhaps a state and a zip.
But in this leftmost column, I could be smart
and start to assign each of these cities some unique ID or identifier.
And so here might I have just a unique identifier, typically an integer.
City might again be varchar (255).
And State might, again, be char (2).
Zip Code, meanwhile, can stay as char (5).
But now what I can do is presume this, that if Cambridge, Massachusetts,
02138 is in this sheet or, if you will, table let's arbitrarily but reasonably
give it a unique identifier as one.
And if I were to add more cities here, like Allston, Massachusetts and its zip
code, I could sign it a unique ID of two.
Because now that I have this data, here's
where you get that relational aspect.
I can relate this sheet or table to my other as follows.
Back in this table now can I go ahead and delete much of this redundancy
here and actually go ahead and say city is not really a city but it's a city
ID.
And here now it can be a proper integer, because the city
in which all of my colleagues live is the exact same one as mine.
Now, here there's still an opportunity for improvement, to be fair.
Because if all of us are in the same building,
maybe that should be factored out as well.
And if I really wanted to go down this rabbit hole,
I could add another sheet or table called,
say, Buildings and factor out also that 33 Oxford Street, give it a unique ID,
and only store numbers.
So in short, the more redundancy and the more waste
that you end up having in your data, the more opportunities there
are to, so to speak, normalize it.
To factor out those commonalities and create
relations between some pieces of data and others
and the convention in computing is to do, quite simply, numbers.
Why?
Well, it turns out it's a lot more efficient to just relate some data
to others by relying on small integers.
Four bytes is not all that much.
And in fact, inside of a computer CPU are
small registers, tiny pieces of memory that
can be used on the most basic of operations;
additions, subtractions, and comparisons for equality.
And so with small values like integers can you
very quickly reassemble or relate some data to others.
And so here we have a general principle of database design
to normalize it by factoring things out.
And so if we go back into our musical database,
you can perhaps infer why the author of this data did that preemptively.
They have their albums having album IDs, a number like 1, 2, and 3, a title,
which is the actual title of that album.
And then to associate those albums with artists,
they've used not the artist's name but an ID.
In this way can an album have the same artist as another
without storing that artist's name twice.
Moreover, if the artist happens to change his or her name,
as is not uncommon in the musical world, you
can change that name in just one place and not
have to scour your tables for multiple copies.
And so if we continue this logic, we'll see in more and more tables
that we have this principle of naming the data but then assigning it an ID.
And if you want to relate some of that data to another,
you simply store the ID, not the actual values.
Of course, this is decreasingly useful as we go, because now some of my data
is in this table, and that, and this other table, and here.
And so while very academically clean and refined, it doesn't anymore
seem useful to the data scientist in front of his or her computer
that just wants to answer questions about data.
And yet it's all over the place, whereas before, with Excel,
you could just pull up a window.
But that's where SQL itself comes in.
SQL does not just prescribe how to type your data
but rather how to query it as well.
And in this particular program here, DB Browser,
I can actually go over to this tab here for Execute SQL.
And I can begin to execute those actual commands, SELECT,
and CREATE, and UPDATE, DELETE, and others and actually see the results.
What SQL allows you to do is express yourself programmatically
using a fairly small language, albeit new, that
allows you to create temporary tables, just the results, just the result
sets, so to speak, that you want, only those rows that you care about.
So for instance, if I want to see all of the albums in this database,
well, in the world of spreadsheets, I would just double click in and peruse.
But in the world of SQL, I'm actually going to type a command.
I'm going to go ahead here and say SELECT star, for give me everything,
from the table called Album, and then semicolon to finish my thought.
I'm going to go ahead and click the graphical Play button here
to execute this command.
And you'll see suddenly that here are all of the albums
apparently in this table, 347 of them in total in this particular database.
And notice that all of the titles are in one column,
the Album ID is to the left, and the Artist ID, of course, to the right.
Well, if you're now curious who is the artist behind the album called
For Those About to Rock, We Salute You, well, I
can just make a mental note that the artist ID is one.
And you know what?
With SQL, it turns out you can use predicates.
You can qualify the data you want.
You don't have to say, give me all.
You can say, give me this.
So how do I do that?
Well, I can actually say SELECT star from Artist, the other table, where--
and here comes my predicate-- artist ID equals one, a semicolon again to finish
that thought, hit Play, and voila.
It turns out it's AC/DC, the artist behind that particular album.
Of course, this felt a bit manual.
And this seems no better than a spreadsheet wherein
now more of the work seems to be on me.
But SQL's more expressive than this.
Not only can you create, and select, an update, and delete data,
you can also join data from one table and another.
So in fact, let me go ahead and do this a little more explicitly.
I want to go ahead and select everything from that original album table,
but I'd like to first join it with the artist table.
How?
Well, I want to join those two tables, this one and this one,
kind of like this, conceptually, so to speak.
On what fields?
Well, in Album, I recall there's an artist ID.
I want that to be equal to artist.artistid.
In other words, if you imagine this hand to be albums and this hand to be artist
and the tips of my fingers each represent those artist IDs,
we essentially want the SQL to line up my fingers
so that I have on the left my albums and I have on the right the artist.
And every one of my rows now has both.
Let me finish my thought with a semicolon here and hit Play.
And voila, now we see even more information but all together.
We see that album number one, For Those About to Rock, We Salute You,
has an artist ID of one and clearly an artist ID of one but a name.
Well, what's happened?
Well, both of these tables have kind of been concatenated together but joined
intelligently such that the artist IDs in both tables
line up on the left and the right.
Of course, at this point, I really don't care about all these numbers.
And I definitely don't need the temporary duplication of data,
so I don't have to just keep saying star,
which is the so-called wild card, which means give me everything.
I can actually instead just say, give me the title of albums
and the names of the artists by specifying with commas
the names of the columns that I actually want.
And if I now click Play, I get much simpler results,
just the titles of albums and just the names of those artists.
Now, how else can we express ourselves with SQL?
Well, there are other keywords besides WHERE and besides JOIN.
You can also group values by something and specify that they must all
have something as well.
For instance, let me go back to my data here and consider
which artists have multiple albums.
Well, if we consider the results that we had earlier do
we have AC/DC as the artist behind For Those About to Rock, We Salute You,
but also behind Let There Be Rock.
Moreover, this band Accept has multiple albums as well.
And if we scrolled further, we'd probably see others.
So if we'd like to see those bands, those artists that have
multiple albums, how can we do this?
Well, what if I somehow collapsed all mentions of AC/DC into just one row,
and next to that row I put an actual count?
Well, I could refine this query as follows.
I can say, yes, join these tables together, but then
collapse them, if you will, based on grouping some common value.
Let's go ahead now and group the data by name
so that any artist that appears multiple times
will effectively be collapsed into one.
But I'd like to remember how many rows got collapsed into just one.
And so rather than select the albums themselves this time,
I'm going to select the album's name and then the count there of,
thereby specifying show me the name and show me the count of that name
before we grouped by.
If I go ahead now and finish my thought and click Execute,
I'll see that, indeed, AC/DC had two names and Aaron Goldberg had one.
And if we keep scrolling, we'd see all of the bands' names
that had one or more albums and the count for each of those.
If I want to filter out now maybe those bands that
only had one hit album in this database, I can instead say,
you know what, go ahead and group by the group's name,
but then show me only those bands having more than one album.
Well, here, too, can I simply filter my results saying literally quite
that, having some number count name greater than one, semicolon.
Hitting Play now and you see the results immediately eliminate all
of those bands that had just one album.
And now if I scroll through, we'll see all those bands
that had two or really more.
And so with SQL can you begin to express yourself
certainly more arcanely than you could with just
a click of the mouse in a spreadsheet but ever so much more powerfully.
But the key is to build these queries up piecemeal.
Indeed, this now already looks quite complicated.
But we began by just selecting all data and then refining, refining, refining,
thereby working at this level and only getting into the weeds
once we knew what we wanted.
Notice now just how fast these operations have been.
Indeed, at the bottom here do I see that 56 rows were returned.
How fast?
In one millisecond.
And indeed, even though it might take me longer
to describe in English what it is I want, in fact,
the computer can find this data so quickly.
But that's because we've done at least one thing already.
This data is already organized with these proper types.
And it also has a few other key characteristics as well.
When storing your data in a SQL database,
you're also asked to answer some questions.
For every table you're asked to specify effectively what,
if any, is this table's primary key.
These are key words in SQL that you can apply
to some column that says, hey, database, this column
is my primary value that uniquely identifies every row in this table.
In the context then of our user spreadsheet
with which we began this discussion, that identifier for City
was a primary key.
I might very well have used a city's name
as unique or perhaps even the zip code.
But far more efficient, especially if you
want to avoid ambiguities or duplication,
is to just use that integer.
And so here a primary key is almost always a numeric value, at least
in the most optimized layouts of data.
But it guarantees to the database that there will be
no duplicates on this particular value.
But more powerfully, you can define in one table a column to be a primary key,
but then in another table that same value to be a so-called foreign key.
In other words, throughout this example in the actual SQL database,
I had Albums in one table and Artists in others.
And that Artist table had an Artist ID column
that was within that table known as a primary key.
But when you saw that artist ID in the Albums table,
it was contextually there a foreign key.
Now, beyond semantics, this is an actual and valuable property.
It ensures that the database knows how to link
and how to link those two columns efficiently.
Moreover, you have even fancier features available to you when you declare keys.
You can also tell the database, you know what?
If I ever delete this artist, go ahead and delete all of that artist's albums
as well.
And you can configure a database automatically
to have this so-called cascade effect whereby data is updated
and your data is consistent at the end of the day based on those relations,
if you will.
Now, in columns of data can you also specify
that every value's got to be unique.
It doesn't necessarily need to be your primary key,
but it might still be unique.
Like what?
Well, in our Users table that we were creating on the fly,
an email address might, by human convention,
be unique, assuming I don't share it with someone else.
But using an email address, multiple characters, many possible characters,
tends not to be the most efficient way to search on data.
So even in my Users table might I have added for best practice
a numeric column as well, probably called ID, as my primary key.
But I might still specify when moving that data from my Google spreadsheet
into a SQL database that, you know what?
Please ensure that this Email column's unique
so that I or some other programmer doesn't accidentally
insert duplicate data into this table.
And moreover, the database then can search it
more efficiently because it knows how many, if any, there
are of any one value.
There's one and only one maximally.
Lastly, there's this keyword here, index.
Well, the other feature you get so powerfully from proper databases
is the ability to search and sort values efficiently.
But the database doesn't know a priori how to do either
on the data you care about.
Because only if you tell the database what data you plan to search on
and sort frequently can it help you in advance.
And so if when creating a database table you tell the database server, go ahead
and index this column, what it will do is use a database structure,
a tree structure not unlike our so-called binary search trees,
that pulls all the data up in an artist's rendition thereof,
thereby ensuring that it doesn't take as many steps
to find some email address or something else because you
have indexed that particular column.
It won't store it linearly top to bottom or left to right.
It will store it in a two-dimensional tree
structure of some sort, often known as a B-tree, that
allows you to grab the data in hopefully logarithmic and not linear time.
Well, turns out there are even more features you
get from actual databases like SQLite.
Well, you have the ability to specify when creating a table, please go ahead
and auto increment this column.
Well, what does that mean?
Well, I very manually a moment ago assigned
Cambridge the unique identifier of one.
But why should I, the programmer, even have
to worry or care about what the unique values of my inputs are?
I just need that that key exists.
I do not need to care about what that value is, just that it exists
and it's unique.
So you can tell the database on its own, please go ahead
and, any time I add a new row to this table,
increment that value automatically.
You can also specify to a database, please
ensure that no values in my database are null that is empty,
thereby ensuring that a bug in your code or some missing user input
doesn't accidentally put into your database
a row that's only sparsely filled with real data.
The database can help you with these things just as Python can as well,
but it's a final layer of a defense before your data.
And then functions as well.
SQL itself is a programming language.
It might not necessarily have as high ceiling as something like Python,
as much functionality.
But built into SQL are any number of functions.
If you want to select the average revenue
across any number of purchase orders, you can use the average function.
And in MySQL Query can I select data but pass it into one of these functions
and get back an answer without having to paste it into, say,
a spreadsheet, let alone calculator.
I can count rows just as I did.
I wanted to count the number of albums that a given artist had,
and COUNT was a function supported by SQL.
You can get maxes and mins.
You can get summations of value and so many more
features built into this language.
And while the tool you might use might not
be DB Browser, perhaps it's just a textual interface or even something
even more graphical, it ultimately is just
executing on your behalf the SQL queries and handing them off to the database
for execution.
Now, with all of these features that you get with the database,
it all sounds too good to be true.
You can scale, you can eliminate redundancy,
and you can still select all the data you want.
But unfortunately, you have to start to think harder
about the design of your system.
Because databases are sometimes vulnerable to mistakes, if you will.
Consider, for instance, something like Twitter
that tends to keep track of how many times something's retweeted.
Or consider an app like Instagram, which keeps
track of how many likes or upvotes or views some post has gotten.
On the most popular or viral of media, those counters,
those integers might be getting incremented ever so quickly.
If you and I both happen to view or like something at nearly the same time,
well, that interface from us into the system
might actually trigger some instruction on some server
somewhere to tell Instagram's database to increment some value.
But how does a database go about incrementing a value?
Well, if the value of views or the value of counts
is somehow stored in a database, a column of type integer,
and you go ahead and execute a SQL SELECT in order to get that value,
and, for instance, 100 people before me has liked some post,
well, the value of that result comes back as 100.
I then do some math in my code, perhaps Python.
I increment the 100 to 101, and then I use a SQL UPDATE, as you can,
to push the data back into the database.
But suppose both you and I anywhere in the world
both happen to like a post at the same or nearly the same time,
as can certainly happen when posts are especially popular.
Unfortunately, a computer can sometimes do multiple things
at once or at least in such rapid succession
that it appears to be at the same time, but a race of sorts
can happen, a race condition, if you will, as follows.
If both my button and your button is pressed
at nearly the same time and that induces execution of code on Instagram server
that selects for both of us the current count of views,
suppose that both of the threads, so to speak, both of the SQL operations
that select that data both come back with the value 100,
each of the blocks of code serving our requests
go ahead and increment that value to 101 and then, via SQL UPDATE,
pushes that value back to the database.
Unfortunately, because both you and I induced an evaluation of that math
at nearly the same time, what the database might end up storing
is not 102 but 101.
In other words, if two people's input is triggering a race to update data,
the database had better keep track of who and when asked for that update.
Otherwise, you lose data.
Now, in the case of tweets or likes, it's probably not all that problematic.
Though, frankly, that is their business.
But you can certainly imagine that with banks or financial institutions, where
the numbers matter ever so more, you certainly
don't want to accidentally lose track of some dollars.
And so how do we go about solving this in the case of a SQL database?
Well, it turns out that there is fairly fundamental primitives or solutions
you can use.
Consider a metaphor in the real world, such as, say, a familiar refrigerator.
And suppose that you and your significant other
happened to both like something to drink at the end of the day, like some milk.
And so you go ahead when you get home, and the other person's not,
and you open the fridge and you see, oh, darn it, we're out of milk.
And so you close the fridge and you head downstairs
and you walk to the nearest store.
Meanwhile, that other person comes home and, wanting some milk,
opens the fridge, and darn it if we aren't out of milk as well.
And so that person now heads out, perhaps in a different car,
in a different route, and heads to the store, some other store to get milk.
Fast forward some amount of time and both of you
come home, and darn it if you don't now have twice as much milk as you need,
and it does go bad.
And so you've both ended up buying milk when really only one of you needed to.
And this is similar in spirit, but now you've
got more data than you actually wanted, but it's not the right amount of data.
So why did that happen?
Well, both of you, like Instagram, inspected the state of some value
and made a decision on it before the other person
was done acting on that information.
So in our very real world of milk, how could you
go about avoiding that conflict, that race, to restock the fridge?
Well, you could quite simply grab a pen and paper and leave a note,
so to speak, on the fridge telling someone else, gone for milk,
and hopefully they then would not do the same.
Or perhaps more dramatically you could lock the refrigerator in some sense
so that they can't even get into it and inspect that state.
But ultimately, you need your act of checking the fridge
and restocking it to be what we'll call atomic.
And databases can and hopefully do provide atomicity,
that property, the ability to do multiple things together or not at all
but not be interrupted by someone else's work.
And in fact, in the database world, these
are generally known as actual locks whereby you say to the database,
don't let anyone else write to this table or row
until I am ready to release or unlock it.
That, of course, though, tends to be a very heavy-handed solution.
Say don't let anyone else touch this data.
Better to do it on a more fine-grained control
so that you don't slow your whole system down.
And so SQL databases tend to support what are more finally known
as transactions whereby you can execute one or more commands again and again
and again back to back but make sure that all of them
go through it once before, say, your commands
that your user input induced actually is allowed to get executed.
Now, honestly, even in the most high-tech
of places like Instagram and Twitter, this is a hard problem.
Because at some point, even waiting for my operations
to complete before yours can go in can be a challenge
and a bottleneck for everyone else.
And so in the case of the most viral of posts, what can systems these days do?
Well, you could just kind of wait and write
that data back to the database that is updated eventually.
And indeed, another property of databases
is known as just that, eventual consistency, a property that says,
don't lose any data, but only eventually make sure it's reflected on the server.
Eventually get the value right, but do get it right.
And so what Instagram and Twitter and others might do
is just cache or buffer that data, waiting until things have quieted down
'til the post is no longer viral or most users have gone to sleep.
Now, that alone might not be to the best of solutions,
but it avoids having to get the highest powered and most expensive hardware.
Of course, in other contexts, that might be the better solution.
In the world of finance, sometimes it comes
down to the actual length of cables or distance from some server to another
to ensure that the data gets there so fast
that you don't run into these sorts of challenges.
So databases can solve this, but the developers and designers
that use those databases need to know how to do it and that they should.
Lastly, there's another challenge as well,
unfortunately all too commonly done these days
because folks just don't defend against it via common mechanisms.
It turns out that a bad actor somewhere on the internet or your own network
can potentially, if you're not careful, trick
a database into executing commands that you didn't intend.
For instance, suppose in the context of Python
you have some code that looks a bit like this.
Here is a program written in a mix of pseudocode
and Python that's designed to allow a user to input
the title of an album for which they want to search.
And so here I use the Python function INPUT to prompt the user for just that.
On the left-hand side do I clear a variable called Title, and then
assigned from right to left, the user's response to that variable.
Then suppose for the sake of discussion there is some function called EXECUTE
whose purpose in life is to take input that itself represents a SQL command.
That SQL command might be this, so like star from Artist
where Title equals something.
Now, what is that something?
Well, if I have the user's input in a variable called Title,
I can use the plus operator in Python, not to add but concatenate
two strings together, coding them singly and completing that thought.
The problem, though, with SQL is that user's not really to be trusted.
And whether the user's input is coming from a keyboard on a Mac,
or PC, or perhaps, more compellingly, from an app or website,
you probably should not trust all your users.
Because suppose that your user typed in not the album name for which they
want to search, Let There Be Rock, but rather they type something
like Let There Be Rock, semicolon, DELETE,
thereby using SQL's apparent DELETE command in order
to trick your database into executing not one but two
commands, a SELECT and DELETE.
And indeed, this is what's known as a SQL injection
attack, the ability for an adversary, a bad actor
out there, to somehow trick your database and your code
into executing some command that you didn't intend.
How is that possible?
Well, some of these characters are dangerous, so to speak.
A semicolon in SQL tends to separate one command from another.
It finishes your thought.
And if you yourself don't anticipate that some human, this bad actor,
might type in themselves a semicolon when they really
shouldn't be typing SQL at all, you might mistake that semicolon
for the actual terminus between one command and another.
And if you just blindly pass it into your server
and let it execute as usual, you might execute not just that SELECT
but that DELETE or anything else as well.
And in this way can an adversary not only delete data from your database
but maybe select more than you intended, or update or insert.
It's ultimately up to you to defend against these threats.
So, how?
Well, it turns out that there are libraries, code written by other people
that, frankly, via very easy-to-use functions,
just make it easy to sanitize or scrub, so to speak, user's input.
What do these libraries or these functions typically do?
Honestly, they just escape, so to speak, these dangerous characters.
Something like a semicolon or perhaps a single apostrophe that
might, in SQL, have some special and dangerous potential meaning,
they just escape them as by putting a backslash, a special character in front
of them so that if the human were to type in Let There Be Rock,
semicolon, DELETE, that would actually be interpreted safely by your database
as a search for an album called Let There Be Rock, semicolon, DELETE,
which of course most likely is not the name of an album.
So that query would probably not return or select any results.
But more importantly, it would not be tricked
into executing two SQL commands.
Rather, it would execute only the SELECT but with a nonsensical value.
Lastly, consider what a database is.
It's really a piece of software running on some computer
somewhere, be it on my own Mac, or PC, or some server in the cloud.
But if you have just one database, as I seem to keep presuming,
you have this so-called single point of failure,
again, just as we had in the world of cloud computing more generally.
And so with something like data where you don't want to lose it
and you certainly don't want all of your web servers or apps
to go offline just because one server, your database server, has gone out,
it's time to revisit the architecture or topology of systems more generally.
Something tells me that it's not going to be
sufficient to have just one database.
You probably want two.
But if you have two databases, now how do you decide where to put the data?
Do you put it here, or over here, or maybe in both places?
If you put it in both places, though, you're then using twice as much space,
so already we've opened a can of worms.
To solve one simple problem, don't be a single point of failure.
But that's going to cost you some time, or some money, or certainly space.
So what can you do if you're architecting
a system that has now not just web servers but, say, databases?
Well, odds are you're going to want to have not just the one, pictured here
as a cylinder, as this convention, but you're probably
going to want to have a second as well.
But of course, if you have two of them, odds
are it's not sufficient just to store half of your data on one
and half of your data on the other, because, of course,
you've not solved the single point of failure.
You now just have two single points of failure because half of your data
could be lost here or half of it here.
So odds are you're going to want to start having backups of data.
But you don't want to necessarily have to waste time restoring
from backup, especially if you want to maintain
as many as five nines of uptime.
So odds are you're going to want to have these databases really
be duplicates of one another.
And whenever you write data to one database,
you should probably write it to the other in parallel.
So, yes, admittedly, you have just spent twice as much space
and, frankly, twice as much money.
But at some point those sorts of costs are inevitable.
But there's other ways to scale here, too.
You can, of course, hit a ceiling on vertical scaling
even when it comes to databases.
After all, if a database is just a program running on some server
and there is only so much RAM or disk space or CPU in that server,
eventually you're not going to be able to store as much data
or as quickly as you want.
So what can you do?
Well, you could, for instance, shard your data
and have not just two but maybe four or more servers and put all of the users
whose names start from A to M on one half of your cluster of servers,
so to speak, but maybe everyone else from M
to Z based on, say, their last name can go on the others.
To shard a database means to split the data in some predictable way
that you can repeat again and again.
But even there, too, even if only the As through Ms are going to the left,
you want to make sure that you still have that backup or replica.
And this arrow suggests that they really should be intercommunicating,
not unlike load balancers we've seen.
But there's another way you can scale your databases as well.
You don't have to have databases doing both reading and writing.
To read data from a database or any server means to take it from its memory
and read it into yours.
And to write means to do the opposite, to save it.
Well, what you can do actually in the world of databases
is also replicate your databases multiple times.
And you might have connected to these two primary databases
multiple other databases that are just copies in one direction and not both.
And what you might then do is use these two primary databases not only to read
but to write, abbreviated here RW.
But these other databases down here, which
are just copies of the ones to which they're connected,
are just called read replicas.
They exist solely for the purpose to read from them again and again.
When might this make sense?
Well, in some contexts, like social media,
like Facebook, it's probably the case that there
are more reads than there are writes.
That is to say you probably know more people who post more content than you
but you probably still read or look at theirs.
And so if the data for your business follows
that pattern whereby writes are maybe common but reads are way more common,
you can do exactly this model and replicate again and again, honestly,
as a tree structure for efficiency so that it doesn't all
have to replicate one at a time.
But then you can write software, be it in Python or something else,
that writes data only to these one or two servers but reads
from any number of them as well.
But this, too, is a bit of a rabbit hole, because at some point
you want to have this redundancy not in one
location but others, east coast and west coast, one country and another.
And at that point, you might actually run into the limitations of time.
Because after all, it takes a non-zero number of milliseconds or seconds
for data to travel long distance.
Consider after all how long it might take
data to go from Cambridge, Massachusetts to somewhere in Japan.
That's far longer than it might take to just go down the road to MIT.
So here, too, we can revisit all of the problems we saw in the world of cloud
computing and servers more generally.
They're back to revisit in the context of databases.
But with databases, you care ever more that these things not go down,
or if they do, that you have spares, backups, or replicas.
Because now that we're storing our data in this centralized place,
we have to think hard not only about how we're scaling computationally
but how we're scaling in terms of our data as well.
So consider where then we began.
We started by laying out data in a spreadsheet, be it Microsoft Excel,
or Apple Numbers, or Google Spreadsheets.
From there we considered what types of data
we might store there so that if we want to upgrade,
so to speak, from a spreadsheet to database,
we know what types we can specify.
And in SQL, whether implemented in SQLite, Oracle, or MySQL,
or something else, they tend to be standard types that
tend to be common across platforms, albeit with some variations,
so that we can think hard about these types
and then ultimately help the database help us be performant.
Because if I know that I'm going to be searching or selecting based
on certain data, I can tell the database, for instance,
to make it unique or at least index it.
And then using SQL constructions like SELECT, and INSERT, and UPDATE,
and DELETE, and yet others can I manipulate that data and get
at it even faster, frankly, than the human me could with a mere spreadsheet.
But with the design of any system, as with databases,
we start to open new cans of worms and new problems
as we start to explore more sophisticated challenges.
But here, too, many, if not all, of these problems
can be solved by simply reducing the problems to first principles
and consider, what is the problem to be solved?
How is our data represented?
Where is it stored?
And consider ultimately what business constraints or legal constraints
we have when manipulating that data and consider what tools of the trade
are available to us.
This then is database design and, more generally, design unto itself,
not just focusing on the correctness of implementing solutions to problems
but the efficiency and the scalability as well thereof.