Placeholder Image

Subtitles section Play video

  • DAVID MALAN: Hello, world.

  • This is CS50 on Twitch.

  • My name is David Malan, and I'm here with CS50's own--

  • COLTON OGDEN: Colton Ogden.

  • Thanks for joining, everybody.

  • CS50 on Twitch.

  • DAVID MALAN: Indeed.

  • Really happy to be back here.

  • Thank you so much for the invitation.

  • COLTON OGDEN: The master jedi.

  • DAVID MALAN: Yeah, so I hear.

  • Nice to see everyone in the chat here already.

  • We were watching as everyone was saying hello to each other,

  • too, just a bit ago.

  • COLTON OGDEN: Let me pull up the Twitch chat you were just talking about.

  • I can see it over there.

  • Let me actually scroll up.

  • We have quite a few messages that we didn't quite

  • read off just yet, but a lot of people in the chat already.

  • DAVID MALAN: Yeah, Bhavik Knight-- he and I are now bffs on Facebook .

  • So nice to see you in the chat, again.

  • COLTON OGDEN: He deemed you the Master Yoda.

  • DAVID MALAN: Oh, I see.

  • Thank you.

  • COLTON OGDEN: MKloppenburg, looking forward to this one.

  • Bella Kirs, another-- oh, RealCuriousKiwi.

  • That's Brenda.

  • She's started joining us on--

  • DAVID MALAN: It is.

  • Brenda from New Zealand.

  • Yeah, Brenda what time is it there back home in New Zealand?

  • COLTON OGDEN: I reload the page and the entire chat disappeared.

  • Guess that's how Twitch works.

  • Yes yeah, you'd have to go back to the actual video, scrub backwards,

  • and it'll replay the chat for you at that moment in time,

  • which is a feature or a bug, depending on how you look at it, I suppose.

  • Yeah, a bunch of people.

  • TZZEK is a new person, I think, with the what is that?

  • The dog sensor flex emoji.

  • I'm not actually show which one that is.

  • Music's pretty loud.

  • We changed the audio a little bit.

  • It should sound a little bit better for our voices.

  • I realize now the music might have sounded a little bit louder

  • as a result of that.

  • But yeah, thanks everybody for joining us.

  • So what are we talking about today?

  • DAVID MALAN: So today Colton asked me here

  • to talk about regular expressions, which is a topic that in CS50,

  • the open courseware course on edX and here on campus,

  • that we don't really spend time on, even though there's definitely

  • some opportunities.

  • So I thought we would introduce them as a solution to problems

  • that you can perhaps tee up for us.

  • COLTON OGDEN: Yeah and you were my first intro to regular expressions, actually.

  • I think for some scripts you needed me to do stuff and--

  • DAVID MALAN: Look at that.

  • Coming full circle.

  • COLTON OGDEN: It is.

  • This is a thing called a regular expression, use it.

  • And I was like, I don't know what a regular expression is, but here we are.

  • DAVID MALAN: How about now?

  • COLTON OGDEN: To a degree.

  • Irene is here.

  • She says, hi, David and Colton.

  • And hi, everybody.

  • DAVID MALAN: Nice to see you, Irene, and Brenda 9:00 AM in New Zealand

  • tomorrow, apparently.

  • So you're a day ahead of us.

  • COLTON OGDEN: Oh, and Minter27, finally, I caught you from the start.

  • Alex Gabrielov, 'sup from Russia.

  • DAVID MALAN: Wow, nice to see you, Alex.

  • All the way here from Cambridge, Massachusetts.

  • COLTON OGDEN: People all over the place.

  • And of course Andre.

  • Andre was actually here when I was testing audio.

  • DAVID MALAN: Andre keeps asking the hard questions.

  • So we're going to take questions maybe at the end, Andre.

  • COLTON OGDEN: If you want, I can I can bring you

  • in if you want to maybe we can start.

  • People feel free to definitely contribute questions,

  • but we don't really have to say as much.

  • People have lots of questions, typically.

  • All programmers and Bhavik Knight.

  • But we'll get started, and if people have questions,

  • maybe they can contribute.

  • DAVID MALAN: Yeah, absolutely.

  • COLTON OGDEN: Oh by the way, new feature.

  • We have the people that so now follow us.

  • DAVID MALAN: Nice.

  • COLTON OGDEN: Something we introduced yesterday.

  • The chat is going to have this grey theme,

  • but the follower theme is kind of in the middle and a little bit more

  • transparent.

  • But buimik7 has now followed us.

  • So thank you very much for buimik7 for the follow there.

  • DAVID MALAN: Nice, nice little animation there.

  • COLTON OGDEN: Sorry, what I was saying is I'll cut us in here.

  • So I have your screen, if you want to maybe start us off,

  • and if people have questions, they can provide it to us in the chat.

  • DAVID MALAN: Sure, so it sounds like you had a good teacher

  • all those years ago who introduced you to regular expressions.

  • COLTON OGDEN: He was pretty good.

  • He was pretty good.

  • DAVID MALAN: What are regular expressions, then?

  • COLTON OGDEN: I would describe regular expressions

  • as a way of matching patterns in text.

  • So being able to specify characters that can either

  • be specific or generic for a class of characters,

  • defining what's called a grammar, and although I

  • don't know the super deep details on formal grammar definitions and whatnot,

  • but I know that it is a grammar.

  • Computer languages, parsers, typically use

  • what are called grammars to verify that you're

  • using the correct semantic details that define what C is, what Python,

  • is et cetera, and extracts symbols out of your text.

  • And you can do the same thing with the regular expressions.

  • That's I guess how I think of regular expression.

  • DAVID MALAN: Yeah absolutely.

  • And a lot of the validation that you might be doing in any web programming

  • that you've done, or if you've taken CS50

  • when we do a bit of this in Python and JavaScript,

  • you might just be in the habit of checking things for equality

  • or maybe emptiness.

  • So if the user did not type something in,

  • their input might be empty, or null, or the empty string,

  • depending on the language.

  • Otherwise, there's a value there which might pass validation,

  • but there's so many opportunities to actually check

  • did the user give you what you wanted them to give you.

  • For instance, did they type an actual email address that has an at sign

  • and a username and a domain name.

  • Did they type in a phone number in a specific format that you care about?

  • And so many other ways where you care not

  • just about the presence of a string, but what it is formatted as.

  • And even more powerfully, suppose that someone does type in, for instance,

  • a phone number, but some people here in the US use parentheses,

  • some people use hyphens, some people might use a plus and an area code.

  • There's so many different ways you might type in a number.

  • You can actually clean those up pretty readily using regular expressions.

  • COLTON OGDEN: Indeed.

  • And I know that's an example that's typically,

  • I think, seen would be like an email address,

  • and something that's a simple identifier of what an email address is,

  • is usually that at symbol right.

  • And normally naive approach would be inspecting every character in a string,

  • if you're in C or in Python, and say if character is equal to at,

  • well, then I can kind of guess that maybe this is an email address,

  • but what if that's, for example, the first character?

  • Then clearly that would be an invalid email address.

  • Or the last character, that would be invalid email address.

  • So that naive approach can be very sloppy and prone to a lot of, I think,

  • errors.

  • DAVID MALAN: Yeah absolutely.

  • And now, Minter27, I see your green screen is a bit messed up.

  • I think this is actually by design.

  • You're hopefully seeing a white border around the bottom and the side,

  • and then a big black box.

  • That's actually my black terminal window.

  • So if that's what you're seeing, that's intentional.

  • We're going to start typing in the terminal window soon.

  • COLTON OGDEN: And then we got MGuudow says hi from Denmark.

  • DAVID MALAN: Hello from Cambridge, Massachusetts.

  • All right, so shall we dive in?

  • I thought we would start with Python, which

  • is a language familiar to some folks, especially if you've

  • tuning into the stream lately.

  • I'm going to go ahead and just open up, for instance, a hello.py program.

  • I happen to be using VIM, which is a command line text

  • editor, similar in spirit to the more graphical Atom, or VSCode,

  • or other such tools these days.

  • And this will just allow me to stay within my terminal window environment,

  • but you can follow along with any text editor, if you'd like.

  • And let's just do something simple in Python.

  • So for instance, if I want to get a user's name,

  • I might say something like name = input, and then

  • just prompt them for their name using Python's built in input mechanism.

  • For those of you who've been following along with CS50,

  • you might know this function as getString,

  • which actually does a bit more error checking,

  • but the idea is exactly the same.

  • And then let's just keep it simple and say something like print, for instance,

  • hello, and then print out the person's name.

  • So no regular expressions yet, no conditions, no fancyness.

  • Let's just make sure we're getting into the momentum of actually writing

  • a program in Python.

  • COLTON OGDEN: And shoutouts to fdc227.

  • Thank you for the follow, as well.

  • DAVID MALAN: Well, nice to see you as well.

  • So let me go ahead and just run this.

  • Python of hello.py.

  • I'm technically going to use Python 3, the latest version.

  • So let me be ever so specific there.

  • It's asking me for my name, so I'll go ahead and type in Colton.

  • And there we have it.

  • Hello dot Colton.

  • Now if I run this program again, which I can

  • do by hitting up in my terminal window's history, which you might know

  • from a Linux or Mac computer or Windows 10, I can play this game again

  • and type in my name David.

  • And then, here, for instance, we might type in Brenda,

  • our friend in New Zealand and now we have a program that's very dynamic.

  • But suppose that we're not such fans of Colton,

  • and we don't want him to be able to participate in this,

  • and we don't want to say hello, Colton if Colton is tuning in.

  • So how might we do this?

  • Well, let's go ahead and back into the program.

  • Again, that program is called VI or Vim, and here's the program at hand.

  • And let me start to add some conditional logic.

  • So for instance, I might say something like, well,

  • if name equals equals 'Colton', well why don't we kind of mess around here

  • and say something like good bye instead of hello.

  • Otherwise, we can go ahead and print out the name as we intend.

  • So still no regular expressions, just using string comparisons now

  • with Python's equality operator, equals equals.

  • And now let me go ahead and run Python 3 of hello.py,

  • and now I'll go ahead and run that.

  • David will go ahead and play along.

  • Very nice and polite.

  • Brenda, very nice and polite.

  • Now we go ahead and type in Colton, and ooh, goodbye, Colton.

  • So not all that polite anymore, but we've just

  • checked for the presence of Colton.

  • So this is all fine and good, but suppose I do this.

  • Huh, I did that quickly, but it actually seemed to work this time.

  • I went ahead and just typed in Colton.

  • Now you can perhaps see it a bit more.

  • Notice here that I'm still greeting you, even though you are Colton.

  • COLTON OGDEN: That very first one did you put a space as well?

  • DAVID MALAN: I did.

  • I secretly put it at the end of the string.

  • COLTON OGDEN: Ah, OK.

  • I missed that part, OK.

  • DAVID MALAN: Indeed, so I did that really fast,

  • but you'll notice that unless Colton's name is exactly C-o-l-t-o-n,

  • it's not actually going to match.

  • So how can we tolerate this?

  • Well, Python actually allows us some ways to do this.

  • If I go into Python, into hello.py again,

  • I could be a little more dynamic, and I could say something

  • like if Colton in name, which will search

  • for a substring of the original string-- look for Colton

  • as a substring of the variable name.

  • Now let me go ahead and run this.

  • So let me go ahead and run Python 3 of hello.py, and I'll go ahead now

  • and type in Colton.

  • Still works.

  • Space space space space Colton, still works.

  • More subtle, Colton space still works.

  • But better yet, Colton Ogden also still works.

  • COLTON OGDEN: OK, so you catch all instances of me.

  • DAVID MALAN: I can catch all instances of Colton.

  • Of course, it does--

  • I don't know, Coltonscopy is Colton's favorite username here,

  • but it will also catch that.

  • COLTON OGDEN: Spelling it wrong, by the way.

  • DAVID MALAN: Coltonscopy.

  • OK, well let's be precise with your name, Coltonoscopy.

  • That too is going to get caught.

  • Why?

  • COLTON OGDEN: Because it has my name in it.

  • It's just a substring of the total name that you have there.

  • So just those first six characters.

  • DAVID MALAN: Yeah, so it's getting a little constrained

  • as to how we might want to express-- it it's getting a little constrained

  • as to how it's behaving, and we might want

  • to start expressing ourselves a bit more explicitly.

  • Or what if we want to do something else altogether?

  • What if I go ahead and type in my full name, David Malan,

  • and I just want it to print out, hello David?

  • COLTON OGDEN: Right, just the first name.

  • DAVID MALAN: Yeah, so now things are getting a little more interesting,

  • and here's an opportunity just to use regular expressions.

  • Now we don't have to, and let's make sure we make clear the different ways

  • in which you can solve problems. .

  • So if a human types in David space Malan, and all you want to say to them

  • is hello, David, how do you think about solving this problem?

  • COLTON OGDEN: Well, I think a somewhat naive approach

  • would be look for the first space.

  • DAVID MALAN: OK, so we could look for the first space.

  • So let's try this.

  • Let me go into hello.oy again, and let's go ahead and get rid of this

  • and just start the story from when we've gotten the user's name.

  • So if we want to split on the space, how could we do this?

  • Like in C, as you alluded to earlier, oh my god, you could do this so tediously.

  • Iterate over every character in the string,

  • and then actually look for the space and print it out.

  • So let's do that for c in name, if c equals equals a space,

  • then we can go ahead and break perhaps.

  • Else, we can go ahead and print out, for instance, the letter C.

  • Now this is going to be a little broken for the moment,

  • but let's see what happens first.

  • Let me go ahead and save this.

  • And I'm going to open a second tab, so that we

  • don't have to keep quitting and opening the program again.

  • So let's go ahead and run hello.py.

  • Let's go ahead and type in Colton Ogden's name.

  • And OK, so we're kind of one step closer to doing this.

  • COLTON OGDEN: You can see that at least the iteration's working.

  • DAVID MALAN: Yeah, it seems to be working, and so that's a nice progress.

  • And what's your middle name?

  • COLTON OGDEN: Taylor, T-A-Y-L-O-R.

  • DAVID MALAN: So here, too, should still work if all we care about

  • is your first name.

  • Now we're opening a can of worms if we want to get your middle name, too.

  • We'll have to come back to that, but let's go ahead now

  • and focus on cleaning this up a little bit.

  • It's printing one character per line.

  • Do you want to propose for folks why that is happening?

  • COLTON OGDEN: Because all we're doing is for every single c in name, which

  • is going to be every character in the name,

  • it's going to make it two checks-- well, it's going to make one check.

  • Two checks, possibly one check, but it's going

  • to say if the character's equal to space, break, else print whatever

  • the correct character is.

  • And print in Python, by default, will print out a new line character,

  • unless you specify a separator.

  • DAVID MALAN: Exactly, and so this is--

  • COLTON OGDEN: Or end of line character.

  • DAVID MALAN: Exactly, and this is a little ugly looking in Python,

  • but you very verbose they have to say the end of my string should not be

  • the default, which is /n, but rather it's just, for instance,

  • the empty string, thereby overwriting it.

  • Looks atrocious, but unfortunately this is the way it is.

  • And it's kind of goes both ways.

  • In C, by contrast, you don't get the new line for free.

  • You actually have to put /n almost everywhere, unless you don't want it.

  • So Python optimized for, presumably, the common case.

  • COLTON OGDEN: We have a few comments and I

  • thought that maybe we'd go through some of those and get us up to speed here.

  • Last I remembered seeing was the hi from Denmark,

  • so unfamiliar4 says bang uptime.

  • I'm not sure-- do you you know that reference to.

  • DAVID MALAN: Let me see.

  • How do we spell that?

  • COLTON OGDEN: Bang uptime.

  • DAVID MALAN: Well, let's try it.

  • So bang up time.

  • I'm not sure if that's what you meant, but--

  • COLTON OGDEN: Not sure what that is, but thank you.

  • I'm hoping maybe that just means, hey, we're up.

  • We're streaming, it's the time.

  • DAVID MALAN: We are up, yes.

  • COLTON OGDEN: Minter27 just said that that

  • was what he was referring to was the black background.

  • He thought that was the green screen.

  • DAVID MALAN: Oh no.

  • That's our black screen.

  • COLTON OGDEN: Hybridpenguin.

  • Hello David and Colton.

  • I'm a CS graduate from Lunds University in Sweden and love your stream.

  • It's really fun watching these streams on YouTube before work,

  • but today I could finally join the stream.

  • DAVID MALAN: Very nice.

  • Glad to see you live as well.

  • COLTON OGDEN: So thank you very much, Hybridpenguin.

  • Asley says, did you know that Python is named after Monty Python?

  • DAVID MALAN: I did actually.

  • COLTON OGDEN: And that's an homage to yesterday's stream with Veronica.

  • That was one of the things that she brought up first us.

  • DAVID MALAN: Yes, I did.

  • I saw part of that yesterday, too.

  • And if you haven't, you can go on CS50's Twitch channel,

  • look at the last few streams, in fact, as well as on YouTube.com/CS50.

  • COLTON OGDEN: Yeah, yesterday's kind of ties into today because Veronica

  • talked about a lot of Python stuff.

  • So super, super, awesome stream.

  • Blowintothecartridge says hi, David and Colton.

  • Watching you guys from Switzerland.

  • DAVID MALAN: Hello from Cambridge.

  • COLTON OGDEN: Keep these streams up.

  • Very, very educational.

  • Thanks for all the great content.

  • Thank you blowintothecartridge.

  • We have seen you before on stream.

  • fdc227, hello from the University of Bristol.

  • Thanks for making knowledge available around the world.

  • We have a lot of coverage today.

  • DAVID MALAN: Indeed all over the world.

  • Welcome from Harvard University.

  • COLTON OGDEN: Cloudxyzc says hi.

  • Hello, Cloud.

  • Forsunlight, who is Fatima on Facebook, Hello Colton and David,

  • regulars, and everybody.

  • Thank you, Fatma, for joining us.

  • And Cloud remarks seems Europe is well represented here.

  • DAVID MALAN: Indeed.

  • COLTON OGDEN: Lots of hellos.

  • Lots of hellos.

  • And then Bhavik Knight says name.split.

  • DAVID MALAN: Oh, spoiler!

  • But good segue, too.

  • COLTON OGDEN: Yeah, we can take a look at that maybe next.

  • I'm assuming that's probably where we're going.

  • DAVID MALAN: Yeah, indeed.

  • So let me turn my attention back to the code where we just

  • left off by adding Colton's fix, where we changed the end of line

  • to quote unquote.

  • Let me go ahead and rerun this now with just Colton

  • Ogden, since his middle name doesn't really add much to the demonstration.

  • And, oh, so close.

  • Now it just looks stupid.

  • COLTON OGDEN: Now we need that new line character.

  • DAVID MALAN: We're going to need it somewhere.

  • So you know what?

  • Let me just go ahead and put it at the very end of the program.

  • We get one for free.

  • We can just call print like this, and now we

  • don't have to worry about being inside the loop anymore.

  • All right so let's go ahead and run this instead.

  • Python of hello.py, Colton, voila.

  • Now we've printed this all out.

  • COLTON OGDEN: It looks great.

  • DAVID MALAN: So we're on our way here, right?

  • We've done a nice heuristic by just looking for the space,

  • but honestly this is pretty tedious, and it

  • feels very C like to iterate over the entire string, looking

  • for some special character.

  • It's not wrong.

  • It's perhaps not well designed because we could abstract this away,

  • and as Bhavik Knight proposes, we can actually use built in functionality.

  • So let me go ahead and do that instead.

  • Let me go ahead and say something like this.

  • Components gets name.split, and split on something like, quote unquote,

  • with a space in the middle.

  • So split, if you're unfamiliar, let's go ahead

  • pull up the Python documentation here.

  • Python 3 str split.

  • str of course implying, a string in Python, the data type.

  • Let's go--

  • COLTON OGDEN: We actually made a reference

  • to how you're not a huge fan of the Python documentation, yesterday.

  • DAVID MALAN: No, I already have misgivings about pulling this up.

  • COLTON OGDEN: And Veronica was saying how much she

  • is a fan of the Python documentation.

  • DAVID MALAN: No, we hereby retract all of yesterday's claims to the contrary.

  • Python's documentation is not very good, if only because it's very arcane.

  • It's incomplete.

  • It leaves too much to the reader's imagination.

  • So here we have str.split.

  • Notice that it takes in two arguments, both named arguments potentially,

  • the separator, as implied by sep which I specified as quote unquote

  • with a space in the middle, and then max split

  • which tells you how many maximal substrings you want to get back in case

  • you care.

  • Negative one the default of no limit whatsoever.

  • So let's just take a look here.

  • These little screenshots in the documentation,

  • if I zoom in here in green, are what are using Python's interactive interpreter.

  • So some human in making the documentation

  • typed this into their Mac or PC and pretty much just copied and pasted

  • the output and put a green box around it.

  • So for instance, if you had a string that had 1,2,3

  • and you call split on it, passing in comma as the split separator,

  • well, you're going to get back this, a data structure in Python of type list,

  • which is like a dynamic array, which has 1 and 2 and 3, which are not numbers.

  • They themselves are strings or substrings, specifically.

  • Here too notice we can max out the number of return values

  • that we actually get in that list.

  • Here we're getting 1 and then 2,3 all as one

  • substring because max split was specified as only split it once for us.

  • And then here, we're getting back everything,

  • including an empty string because we now have two commas in a row.

  • So just one tool in your toolkit if you've never used split before.

  • It's just a useful way of literally splitting a string.

  • COLTON OGDEN: And I think you can actually not specify the space

  • and it will still default to spaces, right?

  • DAVID MALAN: Well, let's take a look.

  • When in doubt consult the documentation, except when

  • the documentation doesn't say.

  • So using sep as the default delimiter string, that's it OK.

  • So if sep is given, consecutive delimiters are not grouped together,

  • dot, dot, dot, duh, duh, duh.

  • Splitting an empty string with specified separator.

  • COLTON OGDEN: Oh here it is.

  • It's right here.

  • DAVID MALAN: There we go.

  • So if sep is not specified, as you propose,

  • or is none, which is the default per the signature up there,

  • a different splitting algorithm is applied.

  • Runs of consecutive whitespace is regarded as a single separator,

  • and the result will contain no empty strings as the start or the end

  • if the string has leading or trailing whitespace.

  • Consequently, splitting an empty string or a string

  • consisting of just whitespace with none returns the empty list.

  • COLTON OGDEN: It's kind of like a combination of strip and split.

  • DAVID MALAN: It is.

  • So it normalizes the space for you.

  • So if you've got multiple spaces in between Colton and Ogden,

  • you're going to split on that.

  • So actually, this is going to be a nice setup for what are regular expressions

  • because we can split on exactly that.

  • COLTON OGDEN: Right, nice.

  • DAVID MALAN: All right.

  • So let's go back to the code here and see what we get back.

  • I called the return value components, and let's

  • just go ahead and, for the moment, print out components

  • to see what's going on inside there.

  • Now I'm going to go over here.

  • I'm going to go ahead and run Python 3 of hello.py.

  • Let's go ahead and type in your full name

  • this time so we get back as many components as possible, and you'll see,

  • just like the documentation, we got back 1 2 and 3, or Colton Taylor Ogden.

  • So how do we go about getting just the first name?

  • If we're assuming a name like structure with first name last name,

  • how do we get the first, would you say?

  • COLTON OGDEN: Well, the lists in Python you can index into them

  • just like in arrays.

  • So you could say index in, by default, in Python.

  • Unlike Lua, Python is indexed at 0, which is what most the programming

  • languages are indexed at.

  • So you could just say, if you're getting the first element, just components

  • index 0, components square brackets 0.

  • DAVID MALAN: OK, good.

  • So we can do exactly that.

  • So instead of printing components, let's print out components 0,

  • go back to my code here, rerun hello.py, and run Colton Taylor Ogden, and voila,

  • we're back where we started.

  • COLTON OGDEN: Easy.

  • DAVID MALAN: All right, so easy peasy.

  • Has nothing to do with regular expressions yet,

  • and that's because we've deliberately confined ourselves

  • to pretty simple inputs.

  • So let's make it a little more complicated and instead of actually,

  • instead of actually getting, say, just an individual name out of it,

  • let's suppose that your input isn't your name, but your email address.

  • Now it's getting a little more interesting,

  • and we're not going to care so much who you are,

  • but that you've given us a valid email address.

  • So let's undo all this, and let's change our variable to be called email.

  • Let's change our prompt to say email.

  • And now let's just say something like, if email,

  • let's just do something like 'Thanks for the email'.

  • Else.

  • Let's just go ahead and say, where is your email?

  • So this is about as simple as validation can get,

  • and you might recall doing this in the context of web programming.

  • Just checking if a string is actually present or not.

  • So let's now go to the program here, run hello.py, and type in, oh forget it.

  • I'm not going to type anything.

  • Huh, well where is your email?

  • So that's a nice little sanity check.

  • All right, let me go ahead and type in my email address.

  • Thanks for the email, but let's just saying ooh, and just type in anything.

  • COLTON OGDEN: Oh, thanks for the email.

  • DAVID MALAN: Unfortunately, now we have a validation problem.

  • It'd be really nice to ensure that, no, you've got to cooperate and give us

  • a valid email address.

  • So how do we do this with split or equals equals?

  • COLTON OGDEN: Well, the first thing that we can do most-- well,

  • I would say all emails do need to have some sort of symbol to specify the--

  • I don't know what the technical name for it

  • is, but the name and then the domain and subdomain.

  • So we could just check to see whether there's

  • an at symbol in the string as the very first step.

  • DAVID MALAN: OK, so let's do that.

  • So let's check for an at symbol, and we did something like this

  • before when we checked for a Colton.

  • Let's just check for an at symbol.

  • So if at sign in email, let's go ahead and say 'Thanks for the email', else,

  • we can go ahead here and say print 'Where is your email?'

  • again.

  • All right, so you can probably see where this is going.

  • It's not going to be a perfect program, but here we go.

  • All right, Malan@harvard.edu.

  • Nice, cogden@CS50.harvard.edu, nice.

  • Colton is a, ah, hmm, interesting.

  • So here in the US, if you just put random punctuation in a sentence,

  • it often means an expletive.

  • Unfortunately--

  • COLTON OGDEN: A lot of unflattering Colton related topics today.

  • DAVID MALAN: Yeah, well I'm just venting, really, today

  • here on the internet.

  • But so it of course, had an at sign in it,

  • which was at the start of a subject--

  • I mean, which just had an at sign somewhere,

  • and unfortunately that's the only question we're asking.

  • So we have to be more precise.

  • It can't just contain an at sign.

  • COLTON OGDEN: So then we need to say, basically,

  • make sure the at sign, first of all, can't be the start or the end,

  • because the at is the specifier for, you have some name, some user,

  • and then they belong to some domain dot whatever.

  • So it can't be at the beginning or the end,

  • and thanks DragonQuestSlime for following.

  • We have some chat to catch up on, but we can maybe

  • do that after the next example.

  • DAVID MALAN: Now Bhavik Knight proposes splitting on the at sign here.

  • COLTON OGDEN: We could do that.

  • DAVID MALAN: Unfortunately, that's going to still

  • be vulnerable to a different sort of threat.

  • If we have multiple at signs in the email address,

  • even if, though, we don't want those, we're going to get multiple parts.

  • And we could check for that, to be fair, but it's not

  • going to be quite as clean.

  • It'd be a lot nicer if I can say, this is the format that I expect.

  • Does the user's input actually match this, so to speak.

  • COLTON OGDEN: The more of these sort of if statements, I think,

  • that we can avoid is ultimately the goal.

  • DAVID MALAN: Absolutely.

  • So let's start to do this a little more sophisticatedly, if you will,

  • and instead of just checking sort of loosely for the presence of an at sign,

  • let's see if someone's email input is user name at domain.

  • And we'll define it only at that super high level for now.

  • So how might I go about doing this?

  • Well, it turns out we can use the regular expression library.

  • So a regular expression is a string that is

  • you said as we began today is a pattern of symbols, numbers, letters,

  • punctuation, and included in many languages is support

  • for matching regular expressions and checking

  • whether the user's input matches, indeed, some pattern you intend.

  • So RE stands for regular expression.

  • You might verbally abbreviate regular expression as RegEx,

  • for regular expression.

  • And so if I import this library, I'm going

  • to have access to a whole bunch of Python

  • functionality that comes related to regular expression.

  • COLTON OGDEN: Do you think there's a RegEx versus RegEx

  • war like there is with gif and jif?

  • DAVID MALAN: Probably.

  • RegEx, I say RegEx.

  • What do you say?

  • COLTON OGDEN: I say RegEx as well because you taught me RegEx.

  • DAVID MALAN: Well, you learned well.

  • RegEx, I mean that's fair because it's regular expression,

  • and yet here I am saying RegEx.

  • I just feel like it flows more--

  • COLTON OGDEN: Well, we say char, as well but it should probably be care.

  • DAVID MALAN: Yo, that's horrible.

  • No one should ever say care for character.

  • COLTON OGDEN: Oh, man.

  • DAVID MALAN: Anyhow, English is messy.

  • As is always our languages.

  • Oh, wait a minute, MKloppenburg, a little spoiler here.

  • Yes, there are even more sophisticated ways of doing this.

  • But we'll get there.

  • We'll get there real soon.

  • COLTON OGDEN: Andre had a funny thing he said earlier

  • when we were talking about the Python documentation.

  • DAVID MALAN: OK.

  • COLTON OGDEN: Oh man, how far up was it?

  • DAVID MALAN: Forsunlight was it?

  • COLTON OGDEN: No, Andre said- oh where was it?

  • Oh yeah, Forsunlight, which is Fatna, said,

  • does CS50 have any plan to improve the Python documentation?

  • Then Andre says, with a flamethrower.

  • DAVID MALAN: [LAUGHS] That would not be inappropriate.

  • I say this mostly with some historical context.

  • For many years, CS50 actually introduced students to PHP

  • at the end of the semester, instead of Python, the upside of which

  • was PHP is even closer to C's syntax.

  • It's pretty much C syntax with dollar signs in front

  • of variables and a few other changes.

  • But it's documentation is outstanding, honestly, especially for newbies

  • to programming.

  • It always has nice examples.

  • It's standardized in how it presents its arguments to functions,

  • return values to functions.

  • There is often threaded discussion that's

  • filtered out so that you have really good questions

  • about the function or the library.

  • So we gave that up when we switched to Python, which is really assumes,

  • I think, a more comfortable demographic, and also

  • an audience that is OK with incompleteness.

  • So unfortunately, we are now among them.

  • COLTON OGDEN: I think the thread idea would

  • be great for the Python documentation because that's actually really smart.

  • DAVID MALAN: The thread, what do you mean.

  • COLTON OGDEN: Having threaded discussion that gets filtered.

  • DAVID MALAN: Oh yeah.

  • COLTON OGDEN: Like Reddit, for example.

  • I think that's a great idea.

  • DAVID MALAN: Indeed.

  • Did

  • COLTON OGDEN: We miss any other questions?

  • I'm going to make sure we didn't.

  • We have people who suggest-- so Cloudxyzc

  • has been suggesting a bunch of different things, checking for the at symbol.

  • He's been following along, and David's real feelings come out

  • is says in reference to the Colton curse joke.

  • DAVID MALAN: Yeah, indeed.

  • COLTON OGDEN: And let's not get started with the gif pronunciation.

  • DAVID MALAN: And you mean the jif?

  • COLTON OGDEN: I say jif.

  • I think because--

  • DAVID MALAN: I grew up saying GIF, but then I decided to go to the source.

  • And if you actually look at the author who created GIFs,

  • he has asked that we call it jif.

  • I think he's the only one with a say in this situation.

  • COLTON OGDEN: Yeah, yeah, yeah, and then Nikolai

  • says Python doc's littered with inconsistencies.

  • DAVID MALAN: Yes no.

  • So anyhow, so without getting too far off track.

  • We only got two lines of code here.

  • We've got to take this home.

  • So we've just gotten user's input, stored it in a variable called email,

  • and let's now ask a more precise question, whether it looks

  • like an email.

  • And there are some more spoilers in the chat window here

  • And we'll come back to those.

  • Those are indeed good next steps, but let's just start to ask this question.

  • So how might we do this?

  • Well, it turns out you could say something like this.

  • If re dot, hmm, how do we do this?

  • Well, I'm going to call it search, and then I'm

  • going to go ahead and specify a regular expression.

  • Now what's a regular expression?

  • It's going to be something at something else, ultimately.

  • And then I'm going to go ahead, and say print out 'Thanks for the email'.

  • And if it doesn't match that, I'm going to go ahead and say the familiar

  • before, 'Where is your email?'.

  • COLTON OGDEN: And to be clear, so this string that you're putting

  • in re.search, the function--

  • [PHONE RINGING]

  • Oh, I apologize for that.

  • The function that's in re-- or the string that's the argument

  • to re.search--

  • DAVID MALAN: Are we going to call it re and not r-e now?

  • COLTON OGDEN: r-e I'm sorry.

  • It's just little easier for me.

  • DAVID MALAN: I'm going to say r-e, but that's fine.

  • COLTON OGDEN: So r-e dot search.

  • That string is basically sort of an abstract representation

  • of what you're looking for.

  • DAVID MALAN: Exactly, indeed.

  • So something is not what we're actually looking for, but let's get there.

  • Let's just take baby steps.

  • Let me save this.

  • Let me go ahead and rerun the program.

  • And let me type in my email address.

  • It's malan@harvard.edu-- not going to validate,

  • because that is literally not something at something.

  • And, in fact, I screwed up entirely, missing one

  • required positional argument string.

  • COLTON OGDEN: OK.

  • DAVID MALAN: So it turns out I'm typing too quickly.

  • I actually have to search a specific string.

  • What is it that I actually want to search

  • needs to be the actual user's input.

  • COLTON OGDEN: Right.

  • DAVID MALAN: So let me actually search the email variable

  • for something at something.

  • COLTON OGDEN: UsmanJafri, thank you very much for joining us.

  • Oh, and Eleevas at the same time, thank you for joining us both.

  • DAVID MALAN: Welcome aboard.

  • So now let's go ahead and run this again.

  • Let me go ahead and type in my actual email address.

  • And of course it doesn't validate because it is not

  • something at something.

  • Let's go ahead and do that, something at something, Enter.

  • OK.

  • So baby step, it's not the end goal, but at least we're one step closer.

  • We now need to generalize what the first something is and the second something.

  • COLTON OGDEN: Yeah, because right now it looks like it's just literally looking

  • for something and something.

  • DAVID MALAN: Yeah.

  • So it turns out, let's start small here and just search for harvard.edu

  • specifically.

  • COLTON OGDEN: OK.

  • DAVID MALAN: So now this still isn't quite correct, but now

  • if I go back to my code and do something at something,

  • that's not going to work anymore because it has to be Harvard.edu.

  • And indeed, something at harvard.edu could now actually work.

  • COLTON OGDEN: OK.

  • DAVID MALAN: Now it turns out that if we go ahead

  • and do malan@harvard.edu that too is not going to work.

  • But if I go in here and stop expecting literally something,

  • and just look for at harvard.edu-- let's go ahead and save this,

  • go back to my program, and now go ahead and run malan@harvard.edu.

  • Now we're getting somewhere.

  • And I dare say, if we go and search for say, dmalan@harvard.edu--

  • slightly different user name--

  • that seems to be working.

  • But, but, but if we search for cogden@cs50.harvard.edu, what

  • you think?

  • COLTON OGDEN: It's not going to work because there's

  • the CS50 subdomain in front of it.

  • DAVID MALAN: Exactly.

  • So where is your email?

  • It's not recognizing you, even though clearly that's

  • a good looking email address.

  • COLTON OGDEN: Can we bring up the source code one more time?

  • I think I missed the exact step.

  • DAVID MALAN: Sure.

  • COLTON OGDEN: re.search@harvard-- oh, OK.

  • Gotcha.

  • Because it's looking basically just for that substring.

  • DAVID MALAN: Exactly.

  • COLTON OGDEN: Make it true, OK.

  • DAVID MALAN: Now we could get rid of the at sign

  • and say, OK, well this will now detect Colton's email address too,

  • but now he could be at not Harvard.edu and that would still match.

  • And so it's getting a little tricky to express precisely, yet generously,

  • exactly what kind of string we're looking for.

  • So it turns out that re.search takes as its first argument, not just

  • a string, which is what I've been using, but a more general regular expression.

  • And a regular expression is, again, a pattern.

  • And it in that pattern you can use special place-holders for strings

  • or for substrings and characters.

  • So in particular, if I want to say something, so to speak,

  • conceptually, I can actually say, put any character there,

  • and then expect the at sign.

  • And if I want to expect two characters, I

  • can do this-- three characters, four characters, five characters

  • or so forth.

  • Or if I'm not sure how many characters, I can say zero or more characters.

  • COLTON OGDEN: OK.

  • DAVID MALAN: Or that's a little weird for an email address,

  • because I do want a user name there.

  • So I can actually say, one or more characters.

  • COLTON OGDEN: So the star is, it could be 0.

  • It could be nothing.

  • DAVID MALAN: 0 or more.

  • COLTON OGDEN: Plus means any positive number of characters.

  • DAVID MALAN: One or more, exactly.

  • COLTON OGDEN: And then he die is just a wild card.

  • DAVID MALAN: A wild card that, for the most part,

  • signifies any possible character-- small white lie.

  • We have implications with whitespace and other special characters.

  • But for the most part, it means any letter, number, or punctuation symbol.

  • COLTON OGDEN: OK.

  • It looks a lot more flexible now.

  • DAVID MALAN: All right.

  • So it's still not going to handle you just yet, but it is going to handle me,

  • it would seem.

  • So let me go ahead and save this.

  • Let me go back to my program, clear the screen and start fresh.

  • Let's go ahead and search for malan@harvard.edu, still working.

  • dmalan@harvard.edu, still working.

  • @harvard.edu, not working.

  • COLTON OGDEN: Right, because you're specified.

  • It has to be at least one character--

  • DAVID MALAN: At least one character--

  • COLTON OGDEN: Before.

  • And is that technically true?

  • Do emails have to have-- can emails be one character long

  • as the first character of the subject?

  • I've seen, like, g.harvard.edu--

  • DAVID MALAN: One is fine.

  • COLTON OGDEN: --as a subdomain.

  • DAVID MALAN: I am pretty sure you need at least one though.

  • The email spec is actually super complicated.

  • And someone proposed earlier that we use a library.

  • That is going to be the best solution in the end,

  • because the format of an email address, even though most of us

  • have pretty normal looking email addresses,

  • there can be some funkiness in there.

  • But I'm pretty sure you need at least one character.

  • So the plus is appropriate.

  • But there's a way around this.

  • Suppose that you forgot about the plus operator--

  • you could say, well, give me one character,

  • and then give me zero or more of another.

  • But plus exists just to express that same syntax, dot dot

  • star a little more succinctly.

  • COLTON OGDEN: Gotcha.

  • It makes sense.

  • DAVID MALAN: All right.

  • So unfortunately-- let me go back to the previous version using the actual plus.

  • It'd be nice if we could actually support my email address

  • and your email address.

  • COLTON OGDEN: Right.

  • DAVID MALAN: So how do we go about expressing that?

  • We kind of want to support something dot harvard.edu,

  • but also no such something.

  • COLTON OGDEN: So then maybe the zero or more thing we looked at earlier.

  • DAVID MALAN: Yeah.

  • So we need a way to kind of express conditionally, maybe it's

  • there, maybe it's not.

  • But if it is there, there's only one of those things.

  • So let me go ahead and say, well maybe we'll support CS50 email addresses dot,

  • but you know what, let's kind of make this optional.

  • And so I can use some special syntax.

  • I can use a parenthesis around-- whoops.

  • I can use a parenthesis to the left and to the right

  • of what I want to make optional.

  • COLTON OGDEN: OK.

  • DAVID MALAN: And then how do I say optional here?

  • I don't want zero or more, because I don't want it to CS50, CS50, CS50, CS50

  • dot harvard.edu.

  • COLTON OGDEN: So I'm thinking like, something or something else.

  • DAVID MALAN: Or could potentially work.

  • So you could actually express or, and you could literally say,

  • a vertical bar, which means, or this.

  • And of course there's nothing there, because my thought

  • has ended with the parentheses.

  • That looks a little weird.

  • So I probably wouldn't typically do that.

  • How else might we do this?

  • COLTON OGDEN: Besides the or, I'm not--

  • I'm not sure, because the or's my first thought.

  • DAVID MALAN: It's the right instinct.

  • But there's just often multiple ways to express this.

  • And so we've looked at star, which is zero more, plus, which is one or more.

  • There also is question mark, which is 0 or 1.

  • COLTON OGDEN: Oh, OK.

  • 0.

  • OK.

  • So it's limited between just 0 or 1.

  • DAVID MALAN: 0 or 1.

  • So it's there or it's.

  • COLTON OGDEN: We can't do CS50 dot CS50 in this case.

  • DAVID MALAN: Exactly.

  • COLTON OGDEN: OK.

  • That makes sense.

  • DAVID MALAN: All right.

  • So let me go ahead now and save this.

  • We're saving hello.py.

  • Let me go back here, clear the screen, just so we can start fresh,

  • and go ahead and type in malan@harvard.edu.

  • Thanks for the email.

  • And here's the test of dmalan@harvard.edu.

  • Looking good.

  • @harvard.edu, not looking good.

  • That's expected.

  • COLTON OGDEN: Right.

  • DAVID MALAN: And here we go, cogden@cs50.harvard.edu?

  • Thanks for the email.

  • COLTON OGDEN: Nice.

  • It worked.

  • DAVID MALAN: So now we're detecting both of our strings here.

  • COLTON OGDEN: Awesome.

  • It's become a lot more robust.

  • DAVID MALAN: Yeah.

  • Now of course CS50 is offered at Harvard and Yale.

  • So some of our staff have cs50.yale.edu email addresses.

  • How can we go about expressing this then?

  • COLTON OGDEN: Maybe for that we could use the or possibly, CS50 or Yale,

  • to limit the two, right?

  • And we could maybe do it within the parentheses where the CS50 dot is?

  • DAVID MALAN: Sure.

  • So we could say, cs50.harvard cs50.yale.

  • COLTON OGDEN: And then you can get rid of-- oh.

  • DAVID MALAN: We'd have to get rid of this, out here.

  • COLTON OGDEN: OK.

  • DAVID MALAN: But I think we can shrink this a little bit.

  • There's a little redundancy here.

  • COLTON OGDEN: You can get ride of the CS50 or take it outside of the--

  • DAVID MALAN: Yeah.

  • COLTON OGDEN: --harvard or yale.

  • DAVID MALAN: So let's unwind here.

  • So this is where we started.

  • COLTON OGDEN: Right.

  • DAVID MALAN: If I know that the domain name now is going to be-- whoops--

  • is going to be harvard or yale, I can literally express exactly that--

  • COLTON OGDEN: Right, OK.

  • DAVID MALAN: --and just say this.

  • And the vertical bar, much like a bitwise or in C or other languages,

  • just means harvard or yale.

  • And notice, you might be inclined to be nice and stylistically pretty

  • and do something like this, like you might in actual programming languages.

  • This is not good though, here, because you are now literally saying,

  • give me a space, then harvard then a space,

  • or give me a space then yale than another space.

  • COLTON OGDEN: Right.

  • DAVID MALAN: So don't try to over engineer your style.

  • Just say exactly and only what you mean.

  • COLTON OGDEN: It's very much whitespace sensitive.

  • DAVID MALAN: Indeed.

  • Now some folks might be inclined here to put a question mark here.

  • Do I want to do that though

  • COLTON OGDEN: Well no because you do want at least some domain, right?

  • DAVID MALAN: Exactly.

  • We want some domain there.

  • So harvard or yale.

  • So we want one and only one, which is just implied by just typing it out.

  • COLTON OGDEN: Right.

  • DAVID MALAN: All right.

  • So let's try this.

  • So let's go ahead and save this.

  • Let's go over here.

  • Let's go ahead and run on malan@harvard.edu.

  • Let's go ahead and run it on cogden.cs50.harvard.edu.

  • Let's go ahead and run out it on malan@cs50.yale.edu.

  • COLTON OGDEN: Wow.

  • DAVID MALAN: We're looking pretty good.

  • It's pretty versatile.

  • COLTON OGDEN: Great.

  • We're loaded to the world of Harvard and Yale, but still--

  • DAVID MALAN: At the moment.

  • Yes, indeed, at the moment.

  • If you have a long list of schools, it's going to get messy quickly.

  • But notice this general principle.

  • Like, honestly, if you were to look at this string,

  • especially being new to regular expressions,

  • I have no idea what this means.

  • But notice how we built it up incrementally.

  • And even to this day, 20-some odd years after learning regular expressions,

  • I do this too.

  • I start matching on the simplest thing possible, test it.

  • Add a little piece, test it.

  • Add another little piece, test it, so that you actually

  • understand everything that's going on.

  • COLTON OGDEN: Forget a little syntax, Google it.

  • DAVID MALAN: Yes.

  • COLTON OGDEN: Forget syntax, Google it.

  • DAVID MALAN: Sure.

  • Indeed.

  • Because it looks very cryptic otherwise at first glance.

  • COLTON OGDEN: I was thinking maybe we can take a few questions.

  • DAVID MALAN: Sure.

  • Let's take a look.

  • COLTON OGDEN: We can take a few comments.

  • I'm going to scroll up here's and see where we stopped at.

  • Oh, Nikolai was referring to the PHP code.

  • It's Inconsistent documentation.

  • It's been forever since I looked at the PHP documentation.

  • Japhics interchange format, says MKloppenburg, in reference to JIF.

  • DAVID MALAN: Oh, yep.

  • OK.

  • Indeed.

  • COLTON OGDEN: Oh, Nikolai with some inappropriate-- so from yesterday

  • as well.

  • So if we could keep it PG, keep it kid-friendly,

  • that would be much appreciated.

  • Do not want to ban anybody from the chat.

  • But we cannot have any of that sort of thing continuing to go on.

  • Appreciate the enthusiasm though.

  • Nick Napoli-- is that the running zombie from the game, Dead Ahead?

  • I'm actually not sure.

  • It's a very ubiquitously seen Twitch widget, I think.

  • I'll have to Google that actually, referring to the follow zombie I think.

  • A lot of comments about the profanity.

  • TwitchHelloWorld wrote, you mean plus, not star, right?

  • I think he's referring to--

  • well we covered both of them.

  • So I think as soon as he wrote that, you probably

  • covered the both of them, the star and the plus?

  • DAVID MALAN: Indeed.

  • Yep, yep.

  • COLTON OGDEN: OK, just making sure.

  • DAVID MALAN: Any Twitch, hello world?

  • COLTON OGDEN: Twitch, hello world.

  • So here plus prototype type plus, though I thought Colton said star.

  • I'm not sure.

  • Did I say the wrong one?

  • DAVID MALAN: I don't recall, to be honest.

  • But let's scroll down further.

  • I think Twitch hello world has to--

  • COLTON OGDEN: Oh, right.

  • It was just explaining afterwards and he got mixed up.

  • DAVID MALAN: Yeah.

  • OK.

  • Down here.

  • COLTON OGDEN: Oh, gotcha.

  • OK.

  • Are plus star and question defined in the function

  • read out search or in the general Python documentation?

  • Thanks.

  • DAVID MALAN: Technically the general documentation,

  • because there are other functions that support regular expressions in Python.

  • There's another function called match, instead of

  • search, which is almost the same, except it starts

  • searching at the beginning of a string.

  • Frankly it's not all that much more compelling,

  • though there might be an optimization gain there.

  • But the syntax is actually derived from earlier languages.

  • Python has simply incorporated them into its own syntax.

  • So the argument to re.search and the argument

  • to re.match and other functions too potentially

  • use that standard regular expression syntax.

  • COLTON OGDEN: And I believe you can type an r in front of the string.

  • And in most text editors it will syntax the regular expression, right?

  • DAVID MALAN: Not my version of them here.

  • COLTON OGDEN: Oh, OK.

  • DAVID MALAN: Yeah.

  • And that actually stands for a raw string,

  • which just tends to be used for regular expressions

  • to escape certain characters.

  • COLTON OGDEN: Gotcha.

  • Gotcha.

  • Blah, blah, blah.

  • Oh, it says MKloppenburg tossed the R-E. I'll say,

  • R-E not RE, the R-E documentation link in chat.

  • How does it differentiate between a wild card and a literal dot?

  • DAVID MALAN: Woo, really good question.

  • At the moment, it doesn't.

  • And in fact I've been a little sloppy here because it turns out--

  • let's see if I can do this.

  • Let's go ahead and run this program with malan@harvardx.edu.

  • COLTON OGDEN: Ugh.

  • DAVID MALAN: Ooh.

  • COLTON OGDEN: Because the dot is just saying,

  • any character you want here, whether it's a period

  • or whether it's something else.

  • DAVID MALAN: Indeed.

  • So even though I've specified a dot and it

  • looks perfectly sensible, harvard.edu, yale.edu,

  • dot indeed means any character.

  • So you'd have to escape it.

  • And this is true in a of languages, Python among them.

  • Anytime you want to say, no, I mean a literal dot,

  • often the answer is just to escape it, the convention for which

  • it is a single backslash.

  • So this now means, not a special place-holder,

  • any character, literally a dot.

  • COLTON OGDEN: Much like you would see for most--

  • what's it called?

  • Escape characters in C.

  • DAVID MALAN: Yep, exactly.

  • Backslash N, backslash T, backslash A, any number of other ones as well.

  • So let's go ahead and rerun this now, malan@harvardx.edu.

  • That does not work now.

  • But if I go ahead and type in malan.harvard.edu,

  • now in fact it works.

  • COLTON OGDEN: Nice.

  • DAVID MALAN: So I should, for good measure,

  • go back in and even fix this here, because I do want,

  • literally, CS50 dot, if it's present, but I don't want to put it here--

  • COLTON OGDEN: Right.

  • DAVID MALAN: --because then your user name would have to be dot, literally.

  • COLTON OGDEN: Or more than one dot, right, because you have the plus?

  • DAVID MALAN: One or more dot, indeed.

  • Exactly.

  • COLTON OGDEN: I'll make sure we didn't miss anyone.

  • EMAAAAAN says hello.

  • I'm Emaan.

  • So thank you very much for joining us.

  • I've finished CS50 but can't verify my account

  • to get a certificate because I can't have an ID.

  • What should I do?

  • DAVID MALAN: As always, Minter27 email certificates

  • @cs50.harvard.edu, which wonderfully is a valid email address for today.

  • COLTON OGDEN: Nice.

  • Unfamiliar-- the art of regular expressions-- captioning

  • what you want and only what you want.

  • DAVID MALAN: That's very beautiful, Unfamiliar.

  • Thank you.

  • COLTON OGDEN: Internet down, lunch break, says PresidentOfMars.

  • Sorry to hear about that.

  • Fatma says, thank you, Colton.

  • There are a lot of content appropriate-- we don't need that here.

  • Asley, Colton is nice, even when he is scolding people.

  • I thought my-- blah, blah, blah.

  • OK.

  • Thank you very much, everybody, for the kind words.

  • So where to go from here?

  • DAVID MALAN: All right.

  • So it turns out, it's still buggy, and no one

  • seems to have pointed this out yet.

  • Let me go ahead and claim that, you know what, my email address is

  • malan@harvard.education, which frankly, these days might actually

  • be a valid TLD.

  • COLTON OGDEN: No, that's true.

  • Yeah, yeah, yeah.

  • DAVID MALAN: Educational-- that's not.

  • COLTON OGDEN: TLD, top level domain, meaning

  • the things you put at the end of-- in a website or an email address.

  • DAVID MALAN: Exactly.

  • So let's go ahead and type this in, and dang it, that is not harvard.edu.

  • So where's the bug?

  • COLTON OGDEN: Well you're just searching for it in the string, right?

  • And like you said, match will identify--

  • does match-- OK, so you said match starts at the beginning of the string.

  • Does that mean that it will keep searching on after that?

  • So it'd basically search with just, starting at the beginning?

  • DAVID MALAN: Correct.

  • Only at the beginning.

  • COLTON OGDEN: And not just matching the contents of the string itself?

  • DAVID MALAN: Correct.

  • COLTON OGDEN: OK.

  • So in that case, search and match would have the same bug.

  • You're just doing a search.

  • You're just iterating through it, and so as it finds the edu part,

  • it doesn't care whether there's nothing after it--

  • DAVID MALAN: Exactly.

  • COLTON OGDEN: Or whether there's another [INAUDIBLE] or something after it.

  • DAVID MALAN: Indeed.

  • So we need to kind of specify that we want

  • to search to the end of the string, and the very last character has

  • to be edu, end of thought.

  • And so it's kind of not obvious how to express this, right?

  • Because you want to say, no more characters, but how

  • do you type no more characters?

  • Well, the authors of regular expressions years

  • ago had to just decide on an arbitrary symbol that denoted end of string.

  • And so the character they chose is, weirdly enough, the dollar sign.

  • COLTON OGDEN: Interesting.

  • DAVID MALAN: So that's not a literal dollar sign.

  • That means edu have to be the last three characters of the string,

  • otherwise we're not going to get a match.

  • COLTON OGDEN: If we wanted a literal dollar sign,

  • would it be backslash dollar sign?

  • DAVID MALAN: Indeed.

  • If you want edu, money, then, yes, escape the dollar sign.

  • But here we want a literal dollar sign.

  • COLTON OGDEN: OK.

  • DAVID MALAN: So now let's go back here, run it again,

  • malan@harvard.educational.

  • Now it's no longer a valid email address.

  • But malan@harvard.edu is actually a valid email address.

  • COLTON OGDEN: OK.

  • DAVID MALAN: All right.

  • So it's not that material that we're matching at the beginning of the string

  • here, but maybe.

  • Let me go ahead and try this again.

  • So david j malan@harvard.edu.

  • That's not actually my email address, but it seems to match the pattern.

  • COLTON OGDEN: Yeah, and it wouldn't it wouldn't fly for a normal email

  • because you have spaces.

  • DAVID MALAN: Yeah, that's not going to work.

  • But I still think there's a bug.

  • So honestly, after all these minutes of, like,

  • building up this regular expression, we're still not done.

  • But we're almost there.

  • COLTON OGDEN: But this would be a million times worse

  • if this were a series of if statements.

  • DAVID MALAN: Yes.

  • Yeah, checking for all possible valid email addresses

  • is not going to be very fun either.

  • So what do we want to express here, perhaps?

  • COLTON OGDEN: No whitespaces allowed, essentially.

  • DAVID MALAN: Yeah.

  • So how would we express that?

  • So it turns out we can approach this in a few different ways.

  • Your first instinct might be to say, well, email addresses should only have,

  • let's say, alphabetical letters.

  • COLTON OGDEN: Sure.

  • DAVID MALAN: So how might we express that?

  • Well it turns out, you can have what are called character classes

  • and a regular expression, whereby you literally

  • type, square brackets, and then you type out all of the characters

  • you want to allow.

  • So for instance, ABCDEF, ABCDEFGHIJKLMNOPQRSTUVWXYZ.

  • I mean, this is not going to scale very well,

  • because I haven't even typed in the lowercase letters.

  • So thankfully, character classes also support Ascii or Unicode ranges.

  • A through Z--

  • COLTON OGDEN: OK.

  • DAVID MALAN: --is valid.

  • And it means the exact same thing as typing out all 26 letters.

  • And if we want to say, lowercase, we can say little a through little z.

  • That will also work.

  • And it does not matter that the big Z is next to the little a.

  • Each of these is being treated as a single character,

  • except for the hyphen, which means a range of characters.

  • COLTON OGDEN: And it's specifically within the context

  • of these square brackets.

  • Because if you did this outside of the square brackets--

  • DAVID MALAN: Yes.

  • COLTON OGDEN: --that would be looking for the literal string, A-Z.

  • DAVID MALAN: Literally.

  • Literally.

  • The square brackets are so important here.

  • And if I want to do numbers, that will work too.

  • At least for decimal I can say 0 through 9.

  • COLTON OGDEN: OK.

  • DAVID MALAN: So now I've got a lot of possible user names now.

  • I'm skipping some characters, but we'll come back to that in a moment.

  • Now that would seem to be a better way of expressing this and omitting spaces.

  • COLTON OGDEN: Right.

  • DAVID MALAN: So let's go ahead here, go back and try this again.

  • david j malan@harvard.edu.

  • And interesting, it's still actually matching.

  • COLTON OGDEN: Tuanvu9884, thank you very much for the follow.

  • Because it's doing a search, and it's finding--

  • it's still finding malan@harvard.edu.

  • DAVID MALAN: Ooh.

  • Yeah, you're good.

  • You're good.

  • COLTON OGDEN: I was taught-- yeah, I was taught by the best.

  • DAVID MALAN: Hey.

  • Yeah, wink, wink.

  • So we kind of want to express that there cannot be anything to the left of these

  • characters.

  • So just to be super clear, Colton has indeed identified the fact

  • that this character class, which is saying, give me one or more

  • of these preceding characters, A through Z, a through z,

  • or 0 through 9, that matches.

  • Because M-A-L-A-N matches exactly that.

  • There's no spaces in malan, but there is a space before it

  • and then there's the period and the j and the space and the david before it.

  • But the catch is that all of that stuff can happen kind

  • of before this character class here.

  • So right in this space, theoretically, is there room for david,

  • j, or any other number of strings that have spaces,

  • because this character class is only going to match what it can,

  • which is malan.

  • COLTON OGDEN: So in this case would we want to switch to the re.match

  • and start from the beginning?

  • DAVID MALAN: We could.

  • So let's try that.

  • So if we go to re.match, re being the library--

  • re.match-- I actually don't know what most people say.

  • re.match will start, by definition, from the start of the string.

  • So let's go ahead and save this.

  • Let's go back here, try one more time. david j malan with those two spaces,

  • and now it seems to be catching the mistake now.

  • Now let's go ahead and do malan@harvard.edu.

  • That's working as intended to.

  • But we don't have to do this.

  • And honestly I find this annoying in Python,

  • that you have to vaguely remember whether it's match or search.

  • And honestly I get them backwards all the time.

  • Literally before we started, I Google to make sure I got it right.

  • I just tend to use search.

  • You might pay, theoretically I suppose, if we read closely in the docs,

  • a slight performance penalty.

  • Because by saying re.match, you're giving the runtime an advantage

  • by just starting literally at the start of the string.

  • But frankly that tends to be an over-optimization.

  • So I would actually just use the opposite

  • of the dollar sign, which completely confusingly is the carat symbol, which

  • is over one of the numbers on your keyboards typically,

  • depending on the country you're from.

  • And that would mean, start from the start of the string.

  • Dollar sign means end of the string.

  • And now we have a complete thought from start to end.

  • COLTON OGDEN: I like that.

  • I like that better.

  • DAVID MALAN: I think it's just kind of cleaner,

  • even though you could certainly make an argument for using re.match instead.

  • All right.

  • So let's try this.

  • Save this, go back to our program.

  • Type in our david space j space malan@harvard.edu.

  • Nope, not allowed.

  • COLTON OGDEN: Right.

  • DAVID MALAN: malan@harvard.edu is indeed now allowed.

  • And just for good measure, make sure we didn't

  • have a regression. colton odgen--

  • COLTON OGDEN: Regression testing.

  • DAVID MALAN: --@cs50.harvard.edu do is also now working.

  • COLTON OGDEN: Nice.

  • DAVID MALAN: So we're in pretty good shape now.

  • Why don't we pop off a few questions that I saw coming in related to emails?

  • COLTON OGDEN: Yeah.

  • Let me go up to where we stopped.

  • So Bhavik says-- oh yeah, raw string, which is what you said.

  • Very easy tutorial in RegEx, says Cloudxyzc.

  • So that's what-- you're saying that you're doing an awesome job.

  • DAVID MALAN: Oh, thank you very much.

  • COLTON OGDEN: Some of us are new to it, like me, says Asley.

  • Yeah, no, this is great.

  • This is awesome.

  • AkshayMandhan says hello.

  • Good to see you.

  • Thank you for joining us today.

  • Brenda's plucking off, as well, certificates@CS50.harvard.edu

  • for cert questions.

  • Thank you, Brenda.

  • Mitch27 says, thank you.

  • Crabs01, I need a course on statistics.

  • I think we're going to have a--

  • sorry, I'm blanking-- a stream on R with Andy Chan in a couple of weeks, which

  • will kind of go into biostatistics.

  • So tune in for that one.

  • Dollar sign, then shortcut to go to the last character of the line,

  • is that the case?

  • DAVID MALAN: Oh yeah.

  • This is now unrelated to regular expressions.

  • Notice my cursor is currently here on the screen.

  • If I hit the dollar sign, I can coincidentally go all the way

  • to the end of the line as well.

  • COLTON OGDEN: It almost seems like it is related.

  • It seems like that's probably deliberate.

  • DAVID MALAN: Oh, I suppose it is, actually.

  • COLTON OGDEN: Yeah.

  • DAVID MALAN: Yes.

  • Dollar sign is, indeed, deliberate.

  • And I can hit the carat symbol to go to the beginning.

  • COLTON OGDEN: Yeah.

  • Yeah, they have to design it that way.

  • That's cool.

  • Vim tutorial-- vim tutorial would be cool.

  • DAVID MALAN: There you go.

  • Well you're going to have to get someone better at Vim than me though.

  • COLTON OGDEN: Irene, would edu dollar accept malan@harvard.educational.edu,

  • or would it only match edu you only at the end?

  • DAVID MALAN: If you literally typed dot edu dollar sign,

  • it would only match dot edu at the end.

  • COLTON OGDEN: Right.

  • DAVID MALAN: If you wanted to support educational, we can go down this road.

  • Notice that we could do, edu(cational), put that in parentheses,

  • add a question mark and make it optional--

  • COLTON OGDEN: Nice.

  • DAVID MALAN: --so that it's there or not there.

  • COLTON OGDEN: OK.

  • And I think she was saying also, educational-edu,

  • like have edu kind of by itself after another string.

  • Like educational edu.

  • DAVID MALAN: No.

  • So if you have dot edu, dollar sign, you will literally

  • match only those four characters-- dot edu dollar sign.

  • Anything more expressive than that, you're

  • going to need to lengthen the string.

  • COLTON OGDEN: Right.

  • We can have gmail, yahoo, et cetera.

  • How to check that after the at symbol-- we have to have one dot character.

  • DAVID MALAN: Say that once again?

  • COLTON OGDEN: We can have Gmail or Yahoo, et cetera.

  • How do we check after the at symbol to have only one period character?

  • DAVID MALAN: Oh.

  • So right now, we would have to jettison our subdomain here.

  • So right now we are allowing for cs50.harvard.edu.

  • Sorry.

  • We're allowing for CS50 dot to either be there or not be there.

  • That's, of course, where potentially our second dot is coming from.

  • So we could certainly get rid of that second dot

  • by just no longer supporting subdomains within harvard.edu.

  • And what's nice about this now is that we

  • could support even other universities.

  • So for instance, we could add Stanford in there or MIT or any number of others

  • without worrying about the subdomain, so long as they all end in dot edu.

  • Or-- let me just free up some space--

  • I could even be a little crazier, and it's going to look a little ugly,

  • but I could do this, in parentheses, and then

  • I could say something like, or gmail.com and actually

  • support either harvard.edu or yale.edu or gmail.com, so long as you

  • build these up a nested fashion.

  • So this is kind of like arithmetic with parentheses.

  • Growing up, if you did lots of additions and subtractions and multiplications,

  • divisions inside parentheses, order of operations matters most.

  • So when reading these things, you're going

  • to want to look for the most deeply nested parentheses.

  • For instance, harvard.edu.

  • then work your way out from those, thereby looking at this.

  • Then you could notice, oh, here's a vertical bar.

  • So that means this thing to the left or this thing to the right

  • is what's going to have to match.

  • COLTON OGDEN: Nice.

  • And Bollco87, thank you so much for following us.

  • Let me make sure that we didn't miss any other questions.

  • Wait, is this the professor from CS50 at Harvard?

  • Says, HomeLine.

  • DAVID MALAN: I think so.

  • COLTON OGDEN: Yes.

  • This is David Malan B, Jedi master, the master Yoda.

  • DAVID MALAN: Named by someone else.

  • COLTON OGDEN: And, yep.

  • And they're saying that a carrot is the Vim shortcut to beginning of the line.

  • DAVID MALAN: So this is turning into a Vim chat.

  • COLTON OGDEN: A little bit, yeah.

  • That's kind of the direction a lot of streams go.

  • ShaneHughes1972, is it advantageous to wait for the next calendar

  • year to start the course?

  • DAVID MALAN: No.

  • If you have time now, start now.

  • There's always going to be something new on the horizon.

  • Companies release new hardware every year.

  • So I think the same logic you might apply to buying a laptop or a computer

  • or a game console or whatnot applies here.

  • Yes, you could wait for the next one, but you're then

  • missing out on the next few weeks, months, or whatever that duration is.

  • So if you want to start something, whether it's

  • CS50 or something hardware or some other course, start when you have the time.

  • COLTON OGDEN: That being said, we are currently

  • in the process of getting our January 1 release for CS50 on EdX 2019,

  • which uses the 2018 material up and online if folks

  • want to put that in their calendar.

  • But folks can go on YouTube just to see CS50 2018's lectures right now if you

  • want to get a head start on all the material, and then sort of do the work

  • and then submit your content, submit your work for the 2018 content

  • to start the calendar year.

  • DAVID MALAN: Absolutely.

  • You're certainly welcome to wait.

  • Brenda, yes we can see you.

  • So for all the extensions, we have to hard code the subdomain.

  • Short answer, yes.

  • You could generalize this with a function.

  • You could build up your string using even rows from a database.

  • But short answer, yes.

  • In the simplest form, you just order them all together

  • using the vertical bar.

  • Of course at some point it becomes less readable.

  • So frankly, from a design perspective of my code,

  • if I want to handle both cs50.harvard.edu and yale.edu,

  • I might leave this regular expression now alone.

  • And if I want to actually support another domain,

  • I might do something like this and say, you

  • know what, I'm also going to support gmail.com or, you know what,

  • let's go ahead and support gmail.com or outlook.com to support two dot coms.

  • And you could start to bucketize your conditions into, these are the edus,

  • these are the dot coms.

  • It's adding some redundancy and it's indeed

  • being a little more wasteful, because you

  • might be checking the strings more times than you need to,

  • but you're probably over optimizing.

  • If you care about that, you're running this code

  • in a loop that's just executing so many times that those milliseconds add up.

  • Frankly I would find something like this probably more maintainable,

  • more readable, even if you're paying a minor performance price.

  • COLTON OGDEN: This would be definitely, if you have some sort of semantic value

  • associate with different domains--

  • DAVID MALAN: Yeah.

  • COLTON OGDEN: --but a lot of websites probably just have generic--

  • you can have any valid email from any given website or domain--

  • DAVID MALAN: Indeed.

  • COLTON OGDEN: --and it will probably, I'm guessing,

  • just use the same pattern that we showed at the very beginning of the A to Z.

  • DAVID MALAN: Essentially.

  • You can actually be more fancy than that.

  • And in fact, let me undo this so that we can build this out.

  • Many of you online probably have email addresses that

  • have, for instance, dots in them, dashes in them, underscores in them.

  • It turns out character classes are wonderfully receptive to that.

  • You can literally just put in underscore.

  • You can put a escaped dot, and you can put an escaped dash.

  • The escaped dash is super important now because, in the context

  • of the square brackets, it represents a range character as well.

  • So now that's even more expressive than it was before.

  • COLTON OGDEN: Vullem, thank you for joining us.

  • He says, I can listen to Mr. Malan for hours-- great teacher.

  • DAVID MALAN: Thank you.

  • COLTON OGDEN: If you want to do so, again, all of our lecture videos

  • are on CS50's YouTube channel, which you might

  • be watching it right now if you're watching this Twitch video on YouTube.

  • DAVID MALAN: Oh, I like what Bhavik Knight has proposed here.

  • Your regular expression's a little fancier.

  • You're supporting not only dot edu and dot come and dot org,

  • but this backslash W is kind of interesting.

  • COLTON OGDEN: Yeah, at the start of the string, which will allow us--

  • I'm assuming it would allow us to put some spaces beforehand,

  • before the email?

  • So that way if the user accidentally hits space or whatnot in the field,

  • it won't say that there's an error.

  • DAVID MALAN: Good hypothesis, but not quite, if I can clarify.

  • COLTON OGDEN: If I'm wrong, I apologize.

  • DAVID MALAN: That's OK.

  • I'm actually going to pull up the documentation here,

  • because I think it might help to see a more

  • thorough listing of the various symbols that are allowed.

  • So let me go ahead and search Python--

  • COLTON OGDEN: It's any non-white space character.

  • Is that correct?

  • DAVID MALAN: There you go.

  • It's literally the opposite of what you are saying.

  • COLTON OGDEN: I forgot about the slash.

  • By the way, missed, TulioNoguera, thank you very much for the follow.

  • And BJeff, I'm not sure if I caught that one.

  • Thank you very much for the follow as well.

  • DAVID MALAN: So here I am on Python's regular expression operations.

  • So you can see this on docs dot Python dot-- whoops.

  • Docs.Python.org/3/library/re.html.

  • So here you'll see Python documentation for regular expressions.

  • And it's way more verbose than we need to get into just now.

  • But let me start to scroll down to regular expression syntax.

  • You'll see some nice introductory explanations

  • of what these things are, though learning from the Python docs

  • is probably to be easier said than done.

  • But here's a list of the special characters.

  • So dot, we already discussed, meaning any character, except for a new line.

  • So I did say there's some corner cases with the whitespace,

  • and that's indeed one of them.

  • Carat, which matches the start of the string.

  • Dollar sign, which matches the end of the string.

  • Star and plus and question mark-- man, we've actually

  • bit off a lot of these for now.

  • COLTON OGDEN: Right.

  • Yeah.

  • DAVID MALAN: Let's keep scrolling further.

  • I'm going to wave my hands at some of these,

  • because there's some fanciness you can get into that, honestly in my life,

  • I've not had terribly many occasions to need greedy matches or--

  • sometimes greedy matches, but you can also do look ahead

  • and some other fancier features.

  • And I don't think we'll get too into depth on that.

  • But rest assured that if you ever encounter

  • a problem that you're struggling with regular expressions, odds are you

  • can solve it.

  • So dive back into the documentation to find some additional feature

  • that they might have.

  • We didn't look at some of these.

  • Curly braces actually have special significance.

  • So let's actually come back to the code we were writing earlier.

  • And suppose, like, someone was proposing we support edu, com, and org,

  • like Bhavik Knight was proposing.

  • Suppose we just kind of generalize that.

  • Well I could say, com or edu or org.

  • Or, you know what, those are three letters.

  • We could maybe, a little lazily, just say,

  • you know what, go ahead and just support three letters, dot, dot, dot.

  • Very lazy.

  • It's going to allow for weird domains that don't actually exist, but so

  • be it.

  • But you could also express that by doing this.

  • So now things are getting really cryptic.

  • But if we parse this, you see harvard or yale.

  • Then you have a literal dot, because it's escaped.

  • Then you have a wild card, any character,

  • and then three copies of any character.

  • Now this doesn't have to be the same character.

  • It's just any character, any character, any character.

  • COLTON OGDEN: And is this 3 referring to the dot

  • that you put before those brackets?

  • Or is it by default brackets?

  • DAVID MALAN: No.

  • It's literally referring to the dot beforehand.

  • So if you wanted to refer to the same letter again and again and again,

  • then it has to be here.

  • So if you wanted to have the letter A, it would be A, A, A.

  • Or dot would be something, something, something, but different something's.

  • COLTON OGDEN: Makes sense.

  • DAVID MALAN: And, if for whatever reason,

  • you want to support, say, two characters or three,

  • which you might with domain names-- like country codes

  • are two letters or more traditional TLDs are 3, you could do 2 comma 3

  • and do a range.

  • So syntax is getting really crazy now.

  • But again, if you just focus on what the basic definition is,

  • it all works out pretty cleanly.

  • COLTON OGDEN: Code Beastie and Stimpy, thank you very much for the Follows

  • DAVID MALAN: Oh, nice to see you as well.

  • And let's finish this thought with backslash W.

  • So let me go back to the documentation.

  • And now I'm just curious.

  • Let's just start going and going and going.

  • And let's see.

  • Here we go.

  • So in the discussion of character classes, as denoted by square brackets,

  • you see a whole bunch of bullets here explaining things.

  • I'm going to fast forward to this one.

  • Character classes, such as slash W or slash S, capital S,

  • are also accepted inside a set.

  • So let's find the documentation.

  • It says, define below.

  • So let's keep going.

  • Excuse me.

  • Let's keep going.

  • There's a lot of features or regular expressions,

  • but you'll use these less frequently, some of them.

  • OK.

  • Here we go.

  • Now we're getting to the special characters.

  • And here, that's our slash W. So for Unicode or str patterns

  • it matches Unicode word characters.

  • This includes most characters that can be part of a word in any language,

  • as well as numbers and the underscore.

  • And you can see, if you actually use Ascii,

  • which is a subset of Unicode, just fewer characters that have been around

  • since the beginning of computers, notice that it's using almost the same pattern

  • that we were using, except that I added in dots and dashes

  • to support things like gmail.

  • COLTON OGDEN: When I started the first time

  • I thought it was a W for whitespace, which is why--

  • DAVID MALAN: No.

  • Totally reasonable.

  • COLTON OGDEN: --I had that instinct.

  • DAVID MALAN: But we got something for you too.

  • COLTON OGDEN: Uh oh.

  • DAVID MALAN: Backslash s.

  • Lowercase s does indeed match any space, which includes a literal space

  • here, a tab character, a new line, carriage return, form feed,

  • and vertical feed as well.

  • COLTON OGDEN: I don't actually know what a form feed is.

  • What is a form feed?

  • DAVID MALAN: A form feed is sort of old school, where

  • it moves down to the next line, I think, from typewriter days, essentially.

  • COLTON OGDEN: OK.

  • It makes sense.

  • It makes sense.

  • DAVID MALAN: I think that's what it is.

  • It's been some time since I needed it to work.

  • But notice-- let me point out one other thing.

  • If we keep scrolling, you'll see kind of the opposite, backslash capital

  • S, which matches the opposite of backslash s.

  • So if you want, not whitespace, but anything other

  • than whitespace with some of these character symbols,

  • you can actually just capitalize it.

  • The same thing for backslash w.

  • If you don't want a word character, you want everything else,

  • all the funky characters on the keyboard,

  • then you can do backslash capital W, all within those character classes,

  • or even outside, if you just want to match one such thing.

  • COLTON OGDEN: So a lot of learning regular expressions is

  • kind of dialing into this documentation and memorizing

  • what all these, sort of symbols mean.

  • But the logic's for it is actually very simple.

  • DAVID MALAN: Yeah.

  • And in fact, unfortunately regular expressions syntax

  • is kind of hard to google, because you're

  • typing in crazy sequences of symbols.

  • So just express it in English or any language

  • you speak to see if you can find your way to Stack Overflow or someone's

  • explanation.

  • COLTON OGDEN: Like regular expressions, avoid whitespace.

  • DAVID MALAN: Yeah, exactly.

  • That's a good one.

  • Should we take a look at the chat again?

  • COLTON OGDEN: Yeah.

  • TwitchHelloWorld says, do some people who have

  • or rent their own server also create email addresses using these?

  • Is there a point at which one might simply

  • codes some wild cards like Colton is saying,

  • then have a program that simply tests that the email goes through

  • without receiving an error message?

  • DAVID MALAN: That's a really good question.

  • And I would make a distinction between syntactically valid and actually valid,

  • where actually valid in my mind would mean

  • it's a real email address that belongs to a real human, that's

  • ideally checking that email account.

  • We're just talking about syntax today.

  • Regular expressions cannot tell you if cogden@cs50.harvard.edu actually exists

  • or if malan@harvard.edu actually exists.

  • All it can tell you is that yes or no, this email address

  • is structured in a way consistent with the formal definition

  • of an email address.

  • COLTON OGDEN: Makes sense.

  • OK.

  • DAVID MALAN: So yes, you would need to use,

  • like a cloud-based service or your own server

  • to actually send a verification email to the human, like all of us

  • are in the habit of receiving when we sign up for new accounts on websites,

  • to actually see if the human responds and confirms the existence.

  • COLTON OGDEN: Got it.

  • Bhavik Knight says, do we need to escape in a group?

  • I think it doesn't need to escape in a group.

  • DAVID MALAN: In a group-- in a capture group,

  • yes, you would still need to escape, if that's what you mean.

  • COLTON OGDEN: How difficult would it be to create a RegEx parser from scratch?

  • DAVID MALAN: That's a good question.

  • How to create a regular expression parser from scratch?

  • It depends on how many features you want to support.

  • To be honest, most of the features that we have just discussed

  • can be implemented relatively simply.

  • And in fact, if we can get all academic on you--

  • can I pull the whiteboard out for a moment?

  • COLTON OGDEN: Yeah, absolutely.

  • DAVID MALAN: So if you've never--

  • COLTON OGDEN: I'll step out of your way.

  • DAVID MALAN: Sure.

  • So here we have an actual whiteboard, no technology here.

  • I've pulled it onto the screen and I've got my black marker here.

  • So it turns out that regular expressions map to, academically, something

  • called the class of regular languages that

  • can be expressed in special syntax.

  • That is regular expression syntax.

  • But it turns out they map directly to what

  • are called DFAs or deterministic finite automata,

  • which are very simple machines that you can implement on Mac or PC or even

  • on a whiteboard, that represent that particular language.

  • So for instance, the way you would typically

  • draw a DFA or deterministic finite automaton is with states.

  • So I might draw a circle here.

  • And hopefully everyone can see this from afar.

  • That circle-- I'm just going to put a little caret symbol there

  • to imply that this is the first state.

  • And if I want to ultimately draw a picture that

  • represents an email address, I'm going to essentially do something like this.

  • I'm going to think of the email address as having

  • three parts-- the beginning, the middle, and the end.

  • And the end will be my final state here, just denoted

  • with a slightly different symbol.

  • And what I want each of these states to represent is something.

  • So I might here think of this state as representing the at sign.

  • At that point I've read in an at sign.

  • So before that is the user name.

  • And after that is the end of my expression.

  • Now how am I actually going to do this?

  • So here I might draw a picture that says something like this.

  • In order to start from this state and move from this state,

  • I have to consume some number of letters.

  • And let's keep it simple and let's just say

  • that it's alphabetical letters for now.

  • So I have to consume, either an a through a z, for simplicity

  • in all lowercase.

  • This though transition means that you would only

  • consume one letter at a time.

  • So at the moment, this picture represents an email address

  • that only has a single letter in it.

  • And in fact I'm going to have to draw another dot here

  • to follow this pattern, so that I have two states here.

  • This is before the at sign.

  • This is after the at sign.

  • So if I draw another transition or edge here, that represents the at sign.

  • And then the end of my email address, let's say,

  • has another alphabetical letter, which is a through z.

  • So short of it is now, I have built a machine, or rather

  • a picture of a machine that says, you can

  • have any character that's alphabetical, then you have an at sign,

  • then you have another letter as your domain.

  • This is obviously incomplete, because A@A is not an email address,

  • at least as we've defined it thus far.

  • So we would need to start to enhance this picture a bit more.

  • So what would that actually mean?

  • Well if I want to support one or more a to z's, I need to enhance this picture.

  • I need to add another state to my machine.

  • And so I'm going to move the start of the machine

  • over here, which you can still now see over here.

  • I'm going to go ahead and have an edge going

  • to this state, which is a through z.

  • But then I'm going to have another transition that allows me to,

  • for instance, go back and forth on a to z as well.

  • And notice, this is a deliberate loop.

  • I can consume an a or z for my string.

  • And then if it's two letters, I can do it again.

  • If there's three letters, I can do it again.

  • Four letters, I can do it again.

  • And when I'm ready to read the at sign, I can then-- whoops.

  • Oh, whoops, whoops, whoops.

  • David messed up.

  • Sorry.

  • This is why we don't do things on the fly.

  • Here we have-- sorry.

  • We have the at sign there.

  • So here we have--

  • here we go.

  • I drew it in the wrong place.

  • My apologies.

  • I can consume a to z here.

  • And now let me go ahead and draw the original dot here, a through z here.

  • My apologies.

  • So I consume the first letter.

  • Then I can immediately consume the at sign from the user's input,

  • and then another letter, thereby putting me from user name, at sign to domain.

  • Or I can consume one letter for the user name,

  • consume another, another, another.

  • And now I have a user name that is one or more characters.

  • And so you see this beautiful mapping here.

  • This now represents, essentially, the block

  • that we described as dot plus earlier, if again, we're

  • keeping it simple with just letters of the alphabet.

  • So you have this direct mapping now between the syntax

  • we've been talking about and the machine, at least

  • pictorially that you might build.

  • And so implementing a parser for a regular expression

  • really amounts to implementing code that does this.

  • And you'll see some familiar constructs.

  • Obviously if you're doing something again and again, this connotes a loop.

  • And all of you know how to implement a loop probably,

  • using a for loop, a while loop, or maybe even

  • recursion to do something again and again.

  • So you could imagine writing code that just has different states

  • or constants that represent each of these states,

  • where one of your variables might mean, I am reading the user name.

  • Then another state that means, I have read the at sign.

  • Then a final state that means, I have read the domain name.

  • And as soon as you end up with a value in that variable that

  • represents the so-called end state, you have parsed an email address.

  • So that's like a whole, let's say week in cs theory.

  • But yes, implementing a parser with a regular expression

  • really boils down to just thinking about how

  • you model that regular expression using a certain syntax,

  • map it to a picture of a machine, and then

  • implement that machine in software.

  • COLTON OGDEN: Let us know if you want David to teach a theory course,

  • because I think that'd be pretty cool.

  • But yeah, that was a cool--

  • DAVID MALAN: I hope that wasn't too much of a tangent there.

  • But that stuff is really quite fun, and it really does bridge

  • the theory and the practical world.

  • COLTON OGDEN: No, no, that was great.

  • Some people had some comments.

  • They said, can you join the first and the second circles?

  • Instead of two just make one?

  • I think maybe make the first one a loop?

  • Or does that first need to be a separate node?

  • DAVID MALAN: Really good question.

  • And that's why I was struggling under pressure.

  • The first state needs to lead to another state,

  • because you have to consume at least one symbol.

  • And if we had put the loop on that first state,

  • we could accidentally never go around that loop,

  • immediately start with the at sign, and that's going

  • to give us an invalid email address.

  • COLTON OGDEN: Because it's looking for those, sort of, what are they called?

  • Transitions.

  • DAVID MALAN: Exactly.

  • COLTON OGDEN: And the transition-- it can take those as paths--

  • DAVID MALAN: Exactly.

  • COLTON OGDEN: --and execute on them.

  • DAVID MALAN: Exactly.

  • COLTON OGDEN: We can continue the same mapping after the at state also

  • to get more than one a to z.

  • Yes.

  • And that's because for brevity, you drew it.

  • But you can have another loop at the end, after the at.

  • DAVID MALAN: Yes.

  • I didn't finish the story.

  • We'd need an actual loop or cycle to do multiple letters.

  • And, frankly, we'll want additional states

  • if we want have a dot and then a TDL like edu or dot org or whatever.

  • COLTON OGDEN: Yeah, that's super cool.

  • It reminds me of--

  • is this a finite state machine as I think?

  • Yeah, because we use that, like in the games--

  • DAVID MALAN: Games, yeah, absolutely, all the time.

  • And honestly, if you're familiar-- they're getting a little dated

  • these days, but soda machines.

  • If you've walked up to a soda machine and put in coins, no matter

  • the country you're in, a soda machine is a finite state machine

  • or a deterministic finite automaton.

  • Deterministic in the sense that no matter how many times you put it

  • in the right amount of money, it will behave exactly the same way,

  • assuming there's still soda left.

  • And the way you can think about a soda machine

  • is being similar, just as with regular expressions, each of those states

  • represented where you are in the string.

  • I have read a character.

  • I have read multiple characters.

  • I have read an at sign.

  • I have read the domain name.

  • Each of those circles on the board represented something conceptually.

  • A soda machine, in the US, for instance, is

  • going to have different states as well.

  • You can insert a dime, a nickel, and a quarter, but not pennies, for instance,

  • typically.

  • So there is probably a $0.05 state, there's a $0.10 state,

  • there's a $0.15 state, there's a $0.25 state, a $0.30 cent state,

  • but there's not a $0.31 or a $0.32 or a $0.33 state,

  • because every time you drop a coin in the machine,

  • it's as though the soda machine is following a transition,

  • following a transition.

  • And as soon as you get to the dollar state,

  • or however expensive the soda is, then the soda pops out.

  • COLTON OGDEN: Can you describe computer programs therefore

  • as being deterministic finite automata?

  • DAVID MALAN: Some programs, yes, if they are indeed

  • behaving completely deterministically.

  • If they're behaving non-deterministically--

  • COLTON OGDEN: That's true.

  • DAVID MALAN: --you might have some randomness.

  • But even randomness in computers is deterministic, at the end of the day.

  • So, short answer, yes.

  • COLTON OGDEN: OK.

  • That makes sense.

  • That makes sense.

  • Let's make sure we're OK-- keep with the chat here.

  • By the way, thank you to axXelus for following us.

  • I saw that pop up during the whiteboard session.

  • DAVID MALAN: Oh, I think we just signed ourselves up for us

  • a course on finite state matches.

  • COLTON OGDEN: I thought that was actually really cool.

  • I liked the whiteboard.

  • Next time we'll try to get the drop 50 in integration to the setup.

  • DAVID MALAN: Indeed.

  • If you saw one of our previous streams with Colton and Dan Coffey,

  • we have the beautiful screen and web-based software that Dan wrote,

  • via which we can draw pictures as well.

  • Much better than old school.

  • COLTON OGDEN: And David has an amazing new--

  • was is this thing called?

  • DAVID MALAN: Oh, well we'll see here.

  • Just off screen is a beautiful new tablet

  • that we can draw on, which will allow us to draw pictures and diagrams much more

  • easily.

  • COLTON OGDEN: Yeah.

  • We'll try to get that set up for our next stream together.

  • Let me see where we are.

  • Yes, a theory course, hearts, say AllProgrammers.

  • Is there a term for this type of diagram?

  • Yes, a theory course by him would be fun because he's always

  • so practical regarding applications too and scaffolds nicely.

  • And I think you did say, a finite state machine

  • or a deterministic finite automata was the name for that.

  • DAVID MALAN: Yep.

  • COLTON OGDEN: Yeah, we need that course, says Osman.

  • Anything taught with passion will be interesting, so

  • keep the lessons coming, says PresidentOfMars.

  • David is very good at that.

  • Wow, that's an old school way of doing it, says Bhavik Knight.

  • DAVID MALAN: Thank you.

  • COLTON OGDEN: Old school, and when we bring the tablet,

  • we'll bring the old school with the new school.

  • DAVID MALAN: True.

  • Though DFAs have been around for a long time too,

  • so maybe that's the old school way of doing it.

  • COLTON OGDEN: When did it first come out?

  • You think like the 50s or 40s?

  • DAVID MALAN: Yeah, around there.

  • It derives from math and discrete math.

  • COLTON OGDEN: Turing, yeah?

  • DAVID MALAN: Yep.

  • Mm hm.

  • COLTON OGDEN: In terms of design, since you

  • have to program the verification code too,

  • to check the email address is a real email,

  • is there any reason not to be somewhat loose and broad in this RegEx code,

  • such as using wild cards and hard coding each of the dot com, dot TV, et cetera?

  • DAVID MALAN: Short answer, yes.

  • Your odds are these days, you're not going to hard code all of the TLDs

  • because there's an atrocious number of them-- hundreds, probably,

  • maybe approaching a thousand or something crazy.

  • So yes, you're probably to focus more on syntax

  • and not on the validity of those top level domains.

  • Because as someone alluded to earlier, you're probably--

  • and actually I think it might have been Twitch Hello World,

  • you yourself-- you're probably going to send the user a confirmation email.

  • And if the email bounces, because the domain doesn't exist,

  • you've got your answer, and you don't have to infinitely,

  • exhaustively check whether or not the email

  • address itself was a valid domain.

  • And if you get the user to click a link, confirming the existence,

  • then you're OK.

  • So, yes.

  • So you'd probably want to do a high pass at the email address,

  • using a regular expression like we are.

  • Or better yet-- and we'll end on this note too--

  • using a library that comes with Python or any number of other languages,

  • just to do that initial validation.

  • Because humans-- at least here on campus,

  • among Harvard University students, we can tell you that about 10% of them

  • miss-type their email address, if we just ask them on a Google form

  • to type it in, unless we pre-populate it, which we do instead.

  • COLTON OGDEN: Makes sense.

  • And I think if we did have a, sort of a limited list of TLDs to choose from--

  • for example, whatever happened back in, I think it was 2013 when they added

  • a bunch of new ones-- if that were to happen again, well,

  • then the code would break for--

  • DAVID MALAN: Exactly.

  • COLTON OGDEN: --new registrants that use those TLDs.

  • DAVID MALAN: But there's a lot of websites

  • out there, especially if you're a student, where

  • you're in the habit of signing up for free stuff

  • because you have a dot edu account.

  • So if you're at a university or a high school that gives you a dot edu address

  • or something more local to your own country,

  • you might use a regular expression.

  • Just make sure that it's an actual student

  • eligible for free software or whatever, so they still have their value.

  • COLTON OGDEN: Right.

  • DAVID MALAN: Certainly.

  • COLTON OGDEN: I can make my own Colton Ogden dot edu and get some free stuff.

  • DAVID MALAN: There you go.

  • COLTON OGDEN: All programmers, can you possibly add DFAs and FSMs into CS50?

  • Maybe just an introduction?

  • There's probably not enough room in the course to integrate those, you think,

  • right?

  • DAVID MALAN: Realistically, no.

  • But that's why we have these live streams

  • and other forms of seminars and suck

  • COLTON OGDEN: I've personally wanted a follow on for a long time.

  • This would be a great--

  • I think RegEx and these together would make a great lecture though.

  • DAVID MALAN: Oh, thank you.

  • COLTON OGDEN: Yeah.

  • We should think about that a little bit maybe.

  • Can draw 50 print a PDA and add pages to it?

  • From what I understand, not PDF, but to PNG files, yes.

  • DAVID MALAN: They can, yeah.

  • COLTON OGDEN: Oh, does PDF work?

  • DAVID MALAN: Well, I mean, draw 50 is just a web-based application.

  • So you could literally go to your own browser's file print menu

  • and generate a PDF of it, which would actually work for you.

  • And speak of the devil!

  • Cs50's own Dan Coffey from last week's stream

  • is here to take your feature request.

  • Dan, come on screen.

  • DAN COFFEY: Obviously, Dan Coffey.

  • DAVID MALAN: So, Dan, we were just asked, can draw 50 print to PDF

  • and add pages to it?

  • And if not, how quickly could you add that feature?

  • DAN COFFEY: So we could download a PNG at the moment.

  • We can easily do an SVG.

  • DAVID MALAN: Ooh.

  • DAN COFFEY: Scalable Vector Graphic.

  • DAVID MALAN: Thank you.

  • DAN COFFEY: I don't think it would be--

  • well.

  • I don't know.

  • Do you need a server side to convert to PDF?

  • DAVID MALAN: Yes, to generate the download trigger, yeah.

  • DAN COFFEY: Cause to generate the PNG download, we just change the headers.

  • DAVID MALAN: Yeah.

  • PDFs are annoying.

  • We might be able to do it.

  • But honestly, using the browser's built-in mechanism

  • is probably the simplest way.

  • COLTON OGDEN: Is something broken?

  • Is that why you came in?

  • DAN COFFEY: I heard that the tablet wasn't working for drawing.

  • DAVID MALAN: Oh, we just haven't connected it.

  • COLTON OGDEN: It's disconnected.

  • DAVID MALAN: That's OK.

  • COLTON OGDEN: I disconnected it to get the stream set up.

  • It had all the other stuff hooked into it.

  • Functional, I'm sure, but not plugged in.

  • DAVID MALAN: We're just here talking about regular expressions,

  • if you'd like to talk about your favorite features

  • or regular expressions-- character crosses or?

  • DAN COFFEY: Reverse search is always fun.

  • DAVID MALAN: Oh, reverse search.

  • Nice.

  • Yeah.

  • COLTON OGDEN: I'm actually not too familiar with that one.

  • DAN COFFEY: I love using just multiple capture groups.

  • It's like the most--

  • to get everything you need in one search.

  • DAVID MALAN: What a perfect segue to capture groups.

  • COLTON OGDEN: Doing a segue for that.

  • DAVID MALAN: So yeah, it turns out that we've

  • been using these patterns thus far to just check whether or not

  • the string matches a pattern.

  • But sometimes you want to extract information from strings.

  • You could use split, as we started.

  • You could use substring.

  • But you could also use what Dan described

  • just a moment ago is capture groups.

  • And a little confusingly, they too tend to use parentheses,

  • but we'll distinguish exactly what's going on here as follows.

  • So how do we go about doing this?

  • So it turns out suppose that we wanted to ask the question,

  • are you from Harvard or are you from Yale,

  • and did you type in your email address?

  • Well let me rewind to a simpler RegEx, which is where we were before.

  • It turns out that every time we use parentheses in this way,

  • we are using what's called a capture group, where

  • we are telling the library, the re library in this case,

  • go ahead and capture those substrings for us.

  • So don't just match on them, but allow me to do

  • something interesting with them.

  • So we can't quite see this now, because we

  • are treating the return value of re search

  • as being a Boolean, which it's technically not.

  • It's actually going to return to us a list that's empty if there's no match,

  • or is non empty if there are matches.

  • So let me go ahead and do this, and say, matches gets re search.

  • And then the equivalent code would just be matches.

  • So I've not done anything new or interesting just yet.

  • Let me get rid of the colon at the end there.

  • But now I'm actually storing the return value.

  • Let's just poke around and see what Dan was

  • alluding to by printing out those actual matches,

  • actually only in the case of it being non-null.

  • COLTON OGDEN: That they exist, yeah.

  • DAVID MALAN: Exactly.

  • So we're going to see, thanks for the email, and then

  • the actual contents of the return value of re research.

  • Let me go ahead and save that, go over to our other terminal window,

  • and let's go ahead and do malan@harvard.edu, Enter.

  • And you see some interesting fanciness here.

  • And it's not obvious what's inside of that,

  • because it's actually a whole object, an object belonging

  • to a certain class called re match.

  • But we can actually check this.

  • Let's go to Python--

  • Python 3, re search.

  • And let's see if we can't find ourselves to the capture groups.

  • Let me search for a capturing group.

  • Let's see, not on that page.

  • Let's do re capture group to find the right documentation,

  • just so folks can consult it later.

  • We use search because that searches any part of the string.

  • Oh, you don't see him, but Dan's still over here, everyone.

  • COLTON OGDEN: Also, shout out to Kareem in the chat.

  • Kareem Zidane.

  • DAVID MALAN: Nice to see you, Kareem.

  • So you'll see here a discussion in this link here,

  • which is on docs.Python.org/3/howto/regex.html,

  • which is more of a discussion of how to use regular expressions.

  • That parentheses also indicate capture groups.

  • And we can actually use some functions that

  • come back as methods inside of that re match object that's returned.

  • So what does that actually mean?

  • I'm going to focus on using group as follows.

  • I'm going to go into my code again, and instead of just printing

  • that out, I'm going to go ahead and say, you know what, let's go ahead

  • and print out the first group that matches one indexed.

  • And let's see what happens.

  • I'm going to go ahead and save that, reload my program,

  • type in malan@harvard.edu, Enter.

  • And we see none came back there, which is interesting.

  • But that's OK.

  • Let's poke around a little further.

  • Let's look at the second group, not one but two.

  • Let's go ahead and run this again, malan@harvard.edu.

  • Interesting.

  • COLTON OGDEN: OK.

  • DAVID MALAN: So why do you think we've captured Harvard the second time

  • but nothing the first time?

  • COLTON OGDEN: What's the capture group over?

  • It looks like the CS50, right?

  • And it was a harvard.edu without a subdomain.

  • DAVID MALAN: Exactly.

  • COLTON OGDEN: So there was no subdomain.

  • DAVID MALAN: Right.

  • So because, in my regular expression I have two sets of parentheses, a.k.a.

  • capture groups as Dan called them, this one in parentheses

  • actually captures, either CS50 dot or nothing at all,

  • because the question mark can mean 0 or 1.

  • The second group of parentheses here captures Harvard or Yale.

  • Because both of the parentheses are there,

  • I'm going to get that group one and group two, it's just one of them

  • might actually be none if the CS50 dot is not actually present.

  • So I could now do something more conditionally.

  • I could do something like this.

  • If matches.group(2) equals equals harvard,

  • I could say something more precisely like, thanks for the Harvard email.

  • Else I could say something more like, print, thanks for the Yale email,

  • just thereby distinguishing the type of email address I got back.

  • COLTON OGDEN: And we can appreciate just how

  • much complexity, in terms of the iterative logic or the imperative logic

  • that we would have had to incur to get to this point here.

  • DAVID MALAN: Absolutely.

  • Indeed.

  • So let's go ahead and do this.

  • So let me go ahead and rerun this program once more, malan@harvard.edu.

  • Oh, thanks for the Harvard email.

  • But, notice, we're still supporting CS50.

  • Thanks for the harvard.edu email, but if I go to Yale's email address,

  • now I've distinguished these two.

  • So the capture group, as Dan referred to it

  • is a perfect name, because you're capturing some part of the substring.

  • You're being handed it back.

  • And in Python you can get at that value by using the group method.

  • COLTON OGDEN: And 0 is the whole string, right?

  • DAVID MALAN: 0's going to be the whole string, which is why I deliberately

  • started at one index.

  • It tends not to be that useful.

  • But it does ensure that if there's a match,

  • the list is going to be non-empty--

  • COLTON OGDEN: Sure.

  • DAVID MALAN: --which is handy.

  • So what if I didn't care about the CS50?

  • I was using the parentheses because I wanted them,

  • but I actually don't want to capture those specifically.

  • It turns out that we can actually tell Python, use these parentheses

  • for grouping, and to actually have or not have CS50 dot there,

  • but I don't necessarily have to specify if they're

  • going to be in the capture group.

  • COLTON OGDEN: Is this Python RegEx syntax specific in this case?

  • DAVID MALAN: It is.

  • So you can think of this as the ternary operator, where

  • in C and in other languages you can use a question mark, then a colon

  • to say if or else.

  • You can also use that as syntax--

  • it's crazy, ugly looking here, but what this is going to do is as follows.

  • Let me go out and save this.

  • Now let me change the group to 1, because it's now

  • going to allow me to use parentheses to say 0 or 1 instances of CS50 dot.

  • But it's not going to return them as a capture group.

  • So I'm using them syntactically, but not to capture as Dan proposed earlier.

  • So let me go ahead and rerun this, malan@harvard.edu.

  • And voila, still detecting.

  • And we're just not unnecessarily capturing stuff we don't want.

  • But again, I can't emphasize enough--

  • I mean, this looks like a train wreck of syntax now.

  • It's just so confusing, certainly if you're new to regular expressions.

  • But the key is that we started, what, 90 minutes ago building up

  • with just looking for the at sign, then looking for the user name,

  • then looking for the TLD.

  • Really take these baby steps and make your use of RegExes really incremental.

  • COLTON OGDEN: Yeah.

  • And I think once you've looked at it a few times, [INAUDIBLE] a few of them.

  • Like this kind of stuff no longer really seems too intimidating.

  • DAVID MALAN: Yeah, absolutely.

  • COLTON OGDEN: Definitely compared to the monstrous block of code you would

  • need to do the same thing, right--

  • DAVID MALAN: Indeed.

  • COLTON OGDEN: --without it.

  • DAVID MALAN: Now it turns out, we can do this way more simply

  • by not doing any of this at all.

  • And if I Google, Python validates email address, you'll see,

  • as someone mentioned in the documentation a bit ago,

  • there's actually libraries that will allow you to do this quite simply.

  • So if you actually use pip or pip3 to install validate email,

  • you can really simplify your life by just saying this.

  • So you can ignore this entire conversation

  • about validating email addresses, for instance, and just use a library.

  • But you're assuming that that library is correct,

  • and hopefully it is, if it's open sourced and lots of people

  • have commented on it and provided feedback and pull requests,

  • did the repo.

  • But this is generally the way to do things,

  • not to reinvent the wheel yourself.

  • COLTON OGDEN: So the TLDR for today's stream is, download library for it?

  • DAVID MALAN: For email addresses, yes.

  • But regular expressions are so much more powerful, right.

  • Because suppose you just have a messy data set, right.

  • Humans are in the habit of typing their mailing addresses differently,

  • their phone numbers differently.

  • You can actually use the symbology that we introduced today

  • in regular expressions to get rid of, maybe all of the parentheses,

  • all of the dashes and a phone number, so that you're

  • left with just the decimal digits.

  • You can do this to clean up street addresses.

  • If someone typed in, 33 Oxford Street, Cambridge, Mass, 02138 all on one line,

  • you could use regular expressions to extract the state, maybe,

  • then hopefully the city, maybe the street address,

  • and with high probability maybe clean it up, ultimately too.

  • COLTON OGDEN: That one sounds like it'd be a little rough.

  • DAVID MALAN: It is.

  • It is.

  • It's better to ask the user from the get-go, what's your street

  • address, what's your city and state?

  • COLTON OGDEN: That's probably why they do it in separate fields in most forms.

  • DAVID MALAN: Absolutely.

  • But these days too, if you need to clean up data, which is not uncommon,

  • if you're inheriting a data set, if you're doing something data science-y,

  • if you've just got a messy data set from another company or colleague,

  • you can clean it up using regular expressions,

  • by just matching or massaging the data the way you want it to be.

  • COLTON OGDEN: Awesome.

  • DAVID MALAN: Should we see if there's any other questions?

  • COLTON OGDEN: Yeah.

  • Yeah, yeah, yeah.

  • That was awesome.

  • Also, shout out to Brian Rodriguez for the follow.

  • Thank you very much.

  • DAVID MALAN: What is it?

  • Sure.

  • Yeah.

  • COLTON OGDEN: I think All Programmers is asking,

  • can you talk a bit about the complexity and age-revered time

  • complexity of using the RegExes?

  • DAVID MALAN: Oh, the time complexity-- you

  • can come up with very perverse regular expressions that

  • are incredibly expensive to use.

  • I am not savvy enough to be able to cite a few such examples offhand,

  • to be honest, because it's been some time since I had to think about this.

  • For the most part you don't have to worry about this, at least

  • for a reasonable length of regular expressions like the ones

  • we have been doing here.

  • Honestly, rule of thumb is, if it kind of fits on the screen--

  • and my font size is pretty big-- you're probably fine.

  • It's only when you start using lots of capture groups, lots

  • of look ahead, which is not a topic we've looked at today where things can

  • get computationally more expensive, because at some point you introduce

  • a bit of non-determinism and you need to figure out what transition to follow,

  • because the state machine you've implied with your regular expression

  • just becomes a lot harder to execute.

  • COLTON OGDEN: Awesome.

  • Parset was saying, you can use Python3 dash i,

  • which will leave the Python process live so you can continue testing.

  • DAVID MALAN: True.

  • COLTON OGDEN: Group one's a whole group.

  • Group two is-- the open parentheses, says Bhavik Knight.

  • Yeah, I think that's what we've mentioned earlier.

  • Brown Rodriguez says, a question for both Colton and Professor Malan.

  • I'm in the process of making a video course for people at work

  • to take independently.

  • Since both of you have done this with great success,

  • do you have any tips or advice?

  • DAVID MALAN: Hmm.

  • Tips or advice--

  • I think you want to make sure you know your audience.

  • And you don't want to teach or introduce the material at such a high level

  • that you're more technical colleagues are kind of bored with it.

  • I think you want to be careful not to speak at too sophisticated a level

  • technically that you're less comfortable colleagues are sort of lost by it.

  • So I would try to find that balance.

  • And a technique we have adopted, at least here on campus

  • is to have material and problems and questions

  • for those less comfortable and more comfortable.

  • So you introduce the sort of standard set of material,

  • but you allow the more comfortable colleagues

  • to dive in deeper, and the less comfortable people

  • to remain comfortable with whatever questions

  • or exercises you're actually challenging them with.

  • COLTON OGDEN: The scaffolding that someone

  • alluded to previously in the chat.

  • DAVID MALAN: So that's another good one too.

  • And hopefully this came across with what we were doing today.

  • We've got a fairly sophisticated regular expression on the screen

  • now, a bunch of conditional logic.

  • We didn't start with that.

  • The very first lines we wrote today were calling

  • input, storing it in a variable, and just printing it out.

  • And so that was an example, albeit a short one of scaffolding.

  • Start here and then go here and here and here and here.

  • And hopefully if your audience is following along and the way,

  • they end up on the top floor, even though you

  • started with them at the base.

  • COLTON OGDEN: Exactly.

  • Exactly.

  • Totally agree.

  • Lots of hair gel, says Asley.

  • He's going to start an extreme.

  • OK.

  • Fatma, would RegEx be used for interpreting regular messages?

  • How Google scans our emails, et cetera?

  • So I guess, for parsing email--

  • the bodies of emails?

  • DAVID MALAN: Yeah, absolutely.

  • I mean it depends what you mean by this, but if Google or other companies

  • are searching for keywords, they could be using regular expressions.

  • They might just be using simple string matching,

  • but it's probably implemented in regular expressions,

  • though they do it on such volume that they

  • might need to be fancier than just using,

  • say, Python RegExes for performance's sake.

  • COLTON OGDEN: Spam filtering type of thing.

  • DAVID MALAN: Spam, yeah, that's another good one.

  • Yeah.

  • COLTON OGDEN: I was making a simple compiler for a course of mine

  • and I used RegExes for removing the comments.

  • DAVID MALAN: Yeah, that's a good one too.

  • And actually we had some code years ago where, in CS50 I

  • tend to write examples for lecture that have comments.

  • Unfortunately if I show those examples in class,

  • it kind of spoils the questions I'm asking.

  • Because if I ask students, all right, what is this line of code

  • do, the problem is, if the comment is right there,

  • they don't have to think too hard about it.

  • So I used to run a command that used a regular expression to get rid

  • of all of the comments, just as you proposed right before class.

  • COLTON OGDEN: I did that on accident for a lecture for games during the summer.

  • DAVID MALAN: The day before you lost all of your comments?

  • COLTON OGDEN: It was kind of a point of fact-- yeah,

  • well, thereby I was a little bit more careful with how

  • clear my comments were.

  • DAVID MALAN: Good thing's version control,

  • which brings us to gets stream too.

  • COLTON OGDEN: Oh, yeah, yeah.

  • Kareem Zidane and I did--

  • Kareem hosted or led the get up stream that we did.

  • And also, I missed it, Brutus Harvenius, thank you for the follow.

  • Is there a way of capturing multiple occurrences

  • of substrings that match a string?

  • DAVID MALAN: Multiple occurrences of substrings--

  • COLTON OGDEN: I'm guessing if you have a line of text that's

  • like, hello, hello, hello, hello, hello, capturing the same one multiple times?

  • DAVID MALAN: Yes.

  • Usually you have to use another function.

  • And I don't know offhand, so I'm checking Stack Overflow here.

  • COLTON OGDEN: Here we go.

  • This is how real programming is done, everybody.

  • DAVID MALAN: So, yeah.

  • It looks like the re library has find all, which does exactly

  • as I think you're describing.

  • COLTON OGDEN: I've had to use that function for something before.

  • DAVID MALAN: Yeah.

  • Now that I see it, I think I have too.

  • But when in doubt, Google and see what comes back.

  • But now that you have the right mental model for what these are,

  • you'll find that the syntax is going to be pretty much on point

  • to today's discussion.

  • COLTON OGDEN: Yeah, exactly.

  • Yale rhymes with email.

  • Harvard does not.

  • Further proof that Yale is better than Harvard, says Blue Booger.

  • DAVID MALAN: OK, we'll let them know.

  • COLTON OGDEN: I think you have might have seen this earlier.

  • You mentioned randomness, and I inquired if it is accurate

  • that true short randomness can be incorporated in cs by detection

  • of the noise at the time?

  • DAVID MALAN: So short answer, yes.

  • This is the closest approximation that computers

  • tend to have these days for true randomness, where

  • you take an ambient sound, temperature, movement-- things that really are not

  • tied to something very deterministic like a computer's clock.

  • Even then there's certainly periodicity in, like, the wind, I presume.

  • I am no expert on wind, but things like that.

  • So just taking ambient noise and environmental data

  • might not necessarily be truly random.

  • The best we typically can do with computers

  • is find a distribution of information that appears to be random.

  • But physical inputs are the closest we can get.

  • COLTON OGDEN: Cool.

  • Makes sense.

  • HomeLine says, that is still a seed to the PRNG, the pseudo random number

  • generator.

  • DAVID MALAN: Mm hm.

  • Yep.

  • COLTON OGDEN: Never comment in your code-- problem solved.

  • DAVID MALAN: There you go.

  • COLTON OGDEN: All right.

  • I think that's the end of the comments.

  • I don't know if you'll be able to stick around for a couple of minutes,

  • just to get some last questions.

  • But that was an awesome, awesome tutorial on regular expressions.

  • DAVID MALAN: Yeah, I think Dan--

  • CS50's own Dan Coffey is about to pop back in with some tips for next steps.

  • So that if you'd like to practice with this and learn more,

  • you have some tools to use.

  • DAN COFFEY: I just wanted to share one tool

  • that I found game-changing when I was exploring regular expressions.

  • And if you want to just Google, RegEx tester?

  • I think it's RegEx pal or RegEx 101.

  • COLTON OGDEN: It's RegEx pal, right.

  • DAN COFFEY: Yep.

  • Either one.

  • The first one or either one is great.

  • And so if you want to copy-paste your regular expression into here.

  • COLTON OGDEN: The return one might not work if this is Python-specific.

  • DAVID MALAN: No, it's not.

  • COLTON OGDEN: Oh, OK.

  • DAN COFFEY: And then you can put all the cases you

  • want to try to test against down here.

  • So you can do a bunch of emails.

  • COLTON OGDEN: Ooh.

  • And it shows you capture group.

  • DAN COFFEY: It shows you the capture group it's being captured in--

  • COLTON OGDEN: That's amazing.

  • DAN COFFEY: --the math.

  • And it also will explain on the right, what the actual breakdown

  • in the top right here explanation--

  • DAVID MALAN: That is awesome.

  • DAN COFFEY: --which is under the chat.

  • COLTON OGDEN: Why is the second one-- oh,

  • is it just different color every other line?

  • The blue?

  • DAN COFFEY: Yeah.

  • COLTON OGDEN: OK.

  • DAN COFFEY: Because you've got the global modifiers on at the moment.

  • So.

  • DAVID MALAN: This is awesome.

  • So here, let me zoom in on the address for everyone online.

  • This is regex101.com, brought to you today by CS50.

  • DAN COFFEY: But it's very helpful if, like,

  • instead of having to constantly keep testing in your terminal window--

  • DAVID MALAN: Yeah.

  • DAN COFFEY: --but quickly to see what matches.

  • This was helpful for me.

  • DAVID MALAN: Yeah.

  • And you'll see here, we're actually using the PHP flavor of this.

  • We can switch to Python, though you shouldn't find really

  • any differences versus what we did.

  • You do see here subtly, the raw string that we

  • alluded to earlier, that just ensures that certain characters don't trip you

  • up when they're not escaped.

  • And you can see JavaScript and Go also has implementations Here too

  • COLTON OGDEN: That's awesome.

  • That's a great tool.

  • DAVID MALAN: Yeah.

  • Thank you so much, Dan Coffey.

  • DAN COFFEY: No problem.

  • DAVID MALAN: Any final questions in the stream here?

  • COLTON OGDEN: We got--

  • Irene says, thank you, David.

  • RegExes are brilliant but always daunting.

  • Building them up little by little makes a lot of sense

  • and makes them much clearer.

  • DAVID MALAN: Absolutely.

  • I think that's by far the biggest takeaway.

  • We only scratched the surface of some of the functionality.

  • Though frankly I think we probably hit some of the most useful features,

  • most commonly used.

  • So that should be a pretty powerful technique.

  • COLTON OGDEN: Dan's the hero.

  • DAVID MALAN: Thanks, Brenda, for all the effort we put into it here.

  • COLTON OGDEN: I used RegEx for my CS50 final project,

  • but now I finally understand what I Googled up from Stack Overflow.

  • DAVID MALAN: Nice.

  • Glad to hear it from MKloppenburg.

  • COLTON OGDEN: Yeah, that was an awesome tutorial.

  • Twitch Hello World-- since you're teaching an HLS course, is it just me

  • or is RegEx really similar to the old school

  • Lexis and Westlaw search terms using star, bang, et cetera, and how of you

  • think of the terms to search?

  • DAVID MALAN: You know, I don't know if there's

  • a connection between tools like that.

  • Something tells me, no, maybe, though star has historically

  • tended to mean the wild card character.

  • Exclamation point has less of a history, I think.

  • So I don't know.

  • I would honestly pull up the Wikipedia article myself on both of those

  • to see what their etymology is of their syntax.

  • COLTON OGDEN: Fetzenrndy has followed.

  • Thank you very much.

  • David, would you use Python for RegEx nowadays or would you ever

  • go to Perl nowadays, says Andre.

  • DAVID MALAN: Personally, no.

  • I mean, PHP and Python essentially inherited Perl syntax

  • for regular expressions, I believe.

  • Don't quote me on that, but I'm pretty sure that's some of the etymology

  • there.

  • And PC-- I think that's even PCR.

  • What does that stand for again?

  • Perl compatible regular expressions, and I

  • think Python 2 essentially adopted the same syntax maybe

  • with slight differences.

  • Perl was actually the first interpreted language that I learned years ago.

  • It's the first language I had to learn web programming.

  • Frankly, it's fallen out of favor.

  • People still use it.

  • Scripts still exist in it.

  • It's not a language I would typically reach for.

  • Frankly I think it's very easy in Perl to write code

  • that you yourself don't understand the next day or weeks later.

  • I think PHP and Ruby and-- well, PHP and Python

  • have done a better job at readability.

  • Ruby is perhaps a little reminiscent in my mind of Perl in its syntax.

  • So personally, nah.

  • I wouldn't really pick up Perl.

  • You can use it's Perl compatible regular expressions in bunches of languages.

  • COLTON OGDEN: Are RegEx still considered slow as in performance, says HomeLine?

  • DAVID MALAN: It depends on how complicated they are.

  • Someone in the chat alluded to look ahead earlier.

  • There are ways to over-engineered them, such that they are so complicated

  • that you do essentially introduce non-determinism.

  • The computer has to try this branch or this branch.

  • However, any non-deterministic machine can be

  • converted to a deterministic machine.

  • The problem is, you might get exponential blow up and just

  • the complexity of it, and therefore the runtime.

  • So short answer, yes.

  • But honestly, unless you are using regular expressions to manipulate

  • or pattern match against huge data sets or some data set,

  • again and again and again in a loop, or many, many, many, many times,

  • don't worry about it.

  • Use the regular expression until you find

  • it to be an actual performance problem.

  • COLTON OGDEN: Xomoo, thank you for the follow.

  • Let's go back up here in the chat.

  • It looks like it's not visible.

  • Thank you for the stream, with a happy crying face

  • says All Programmers, like a crying face with joy.

  • DAVID MALAN: Nice.

  • COLTON OGDEN: Sort of.

  • We got another-- oh, that was Xomoo on there.

  • Bella says, thank you.

  • Thank you, Bella.

  • Asley says, Dan's appearance.

  • Thank you so much, David and Colton--

  • Dan.

  • This was very informative and easy to follow.

  • DAVID MALAN: Nice.

  • COLTON OGDEN: Thank you for joining us.

  • Very good intro to RegEx, says Bhavik.

  • Thanks, David and Colton.

  • Thank you.

  • Brenda-- this has been great.

  • My RegEx knowledge has now grown a lot.

  • Thanks, David and Colton.

  • DAVID MALAN: Nice.

  • COLTON OGDEN: Osman, thank you so much.

  • I took CS50x in 2013.

  • Until now I can't get enough of CS50.

  • DAVID MALAN: Nice.

  • COLTON OGDEN: And I think Brenda also said she took CS50x in 2012.

  • She says, we're oldies.

  • I think I first looked at it in 2010.

  • So--

  • DAVID MALAN: Oh.

  • Brenda, he just one upped you.

  • COLTON OGDEN: That puts me up there.

  • DAVID MALAN: I think Dan--

  • Dan Coffey also took it in 2012?

  • DAN COFFEY: 2010.

  • DAVID MALAN: '10, damn it.

  • That's right.

  • Dan was my colleague in 2012.

  • COLTON OGDEN: If someone very experienced in AI

  • suggested that I not wait to learn it and instead seek and take

  • funding for an app, I think you could eventually code a simple MVP,

  • although not the full iteration.

  • I don't know how to code.

  • Yeah, do you agree with Paul Graham that it is not

  • good-- do you agree with Paul Graham that it is not a good idea

  • to start a tech startup if you don't code well enough to select and know

  • if you're tech programmers are coding well?

  • DAVID MALAN: I think there are many examples of folks

  • who have started companies who don't necessarily

  • know how to code well themselves.

  • COLTON OGDEN: Apple.

  • DAVID MALAN: I mean, Apple and even Bill Gates

  • very quickly stopped writing code, shortly after founding Microsoft

  • is my understanding.

  • So while I do think there is some guidance

  • to be taken from comments and sentiments like that, where the reality is,

  • you will have a leg up if you could just better

  • understand what your team members are doing

  • and what your colleagues are doing.

  • You can hold your own in a conversation.

  • You can participate in the conversation.

  • You can provide inputs and provide better direction.

  • I think different people have different skills,

  • and you certainly shouldn't not do something,

  • just because you think you're not as strong as someone else.

  • COLTON OGDEN: No hard, fast rules.

  • Just be sensible, basically?

  • DAVID MALAN: Yeah.

  • Indeed.

  • COLTON OGDEN: OK.

  • Now it looks like we've caught up with all the comments.

  • DAVID MALAN: Thank you so much, everyone.

  • COLTON OGDEN: Thanks so much, everybody.

  • Thanks to David for his awesome RegEx tutorial.

  • DAVID MALAN: Thanks to Dan for his pop-ins today.

  • COLTON OGDEN: Thank you, Dan, yeah, for his first contribution.

  • Tomorrow we have a super secret stream that we're not going to spoil.

  • DAVID MALAN: Oh, yeah.

  • I hear good things about this one.

  • COLTON OGDEN: This one's going to be great.

  • Tune in for that one.

  • David and I will be here for that one tomorrow at 1:00 PM.

  • DAVID MALAN: 1:00 PM Eastern time.

  • COLTON OGDEN: So, yeah, no spoilers.

  • 1:00 PM Eastern Standard time.

  • So thanks again, everybody, so much.

  • Final word, closing word?

  • DAVID MALAN: This was CS50 on Twitch.

DAVID MALAN: Hello, world.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it