Subtitles section Play video Print subtitles Today we're going to do some more live coding, and we're going to talk about something which is quite close to my heart, because I wrote some of the early papers on it many years ago, and that's something called functional parsing, or combinator parsing. Before we get started with actual live coding, I want to begin with the question. And the question is: what actually is a parser? So for me a parser is a program which takes a string of characters as its input, and as its output produces some form of tree. And the idea is that the tree makes explicit the structure in the input string. So that's a bit of a mouthful, so let's have a simple example to explain what's going on with this. So what we've got here is we've got a string of five characters -- 2 plus 3 times 4 -- and that's our input string. But when we look at this we know that that's not just a random sequence of five characters, it's actually got some structure to it. So in particular, we've got three numbers here -- 2, 3 and 4 -- and we've got two arithmetic operators -- we've got the plus and the times -- and one of the things we learn at school is that the times happens before the plus. So you really do 3 times 4 here, and then you add the 2 on at the end. So that's our input string. So what a parser does, is it takes a string of characters -- like that -- and it tries to recognize the structure in the form of a tree. And the tree we would get from a parser is something like this here. And you see we've got some leaves in the tree -- the leaves are the numbers 2, 3 and 4 -- and the nodes in the tree are the two arithmetic operators -- that's the plus symbol and the times symbol -- and you can see the structure of the tree reflects the fact that we do the multiplication. First we multiply 3 by 4, and then we do the addition second. So that's the basic idea of what a parser is: it takes as input a string of characters, and as output it produces a tree. So it's really a function -- it's taking as input a string and as output as a tree. So we've seen what a parser is. It's a function that takes a string as an input and produces a tree as an output. So now we want to think how can we actually implement this: how can we implement the idea of a parser? And I'm going to do this in Haskell today, but it doesn't matter if you don't know anything about Haskell because I'll be explaining everything as I go along. And actually, nothing I'm going to show you today is specific to a language like Haskell. You can do it in any general-purpose programming language. So if you do a web search for a 'combinator parsing' or 'functional parsing', and then whichever language you're interested in, you'll find the same kind of stuff which I'm going to show you today. So it's not specific to the Haskell setting. So what we're going to do then first, is we're going to define precisely, in our programming language, what it means to be a parser. We're going to define a new type called Parser, and it's simply going to be a function which takes a string as an input and produces a tree as an output. So it captures the very simple idea of what we are wanting to do -- a parser is a function that takes a string as an input, and gives the tree as an output, and the arrow here just means that we have a function from one thing into the other. So this captures the basic idea of what a parser is. But unfortunately, it's not sufficient to program with. We need to refine this a little bit to actually write programs with this. So the first little refinement we're going to make is that a parser might not consume all of its input. So for example, if we're trying to parse a number, like 2, we maybe we find the 2, and then maybe we've got some more stuff left in the input string that we need to parse. And if we want to chain parsers together, we're going to need to have access to the remaining input string that we didn't manage to consume. So the first little refinement that we'll make is rather than just returning a single tree, we're actually going to return a pair now. We're going to return two things -- we're going to return a tree as before, and we're also going to return the unconsumed part of the input string. So that's the first refinement we've made. The second refinement we're going to make is that a parser may not always succeed. We may be trying to parse a number, and we don't find the number, we find something else. So we need to have a way of representing that a parser can fail. So the way we're going to do that, is we're actually going to make a parser return a list of results, rather than a single result. So lists in Haskell are denoted using square brackets. So this simply means that rather than returning one pair of results, we could return zero, or one, or two, or as many as we like. And the idea is going to be if our parser can't parse, so it doesn't succeed, we'll return an empty list of results. And if it does succeed, we'll return a list with one pair -- we'll return a tree, that represents the structure of the input string, and we'll return the unconsumed part of the input. But because we're working with lists here, we could actually be more flexible. We could return two, three, or four, or five, or as many as we like parses. And this is actually quite a good flexibility, because for some languages the input string may be ambiguous -- maybe we're trying to parse English, and English sentences don't always have one parse, they can be interpreted in many ways. So this type here is giving us a flexibility to return many results if we wanted. We're not actually going to use that flexibility today, but it's nice to actually have it. So I haven't told you what the tree data type is here. And that's because I'm actually going to get rid of that now. Sometimes you may want to return a number, or a program, or some kind of other structure. So we're going to replace that specific type of trees by some arbitrary type 'a'. And I'll make this a parameter of my type. This is our final type, which we're going to work with today. What we're saying is that a parser whose results have type 'a' is simply a function which takes a string as an input, and then it gives a list of results. And each result is a pair comprising a single value of type 'a' -- maybe a tree, maybe a number, maybe something else -- and then an unconsumed part of the input string. Okay, so this is our final type. And if you look in any of the articles, or books, about these kind of parser combinators, or functional parsers, you'll find a type very similar to this there. It's quite a mouthful again, so let's think about how we could understand this in a simpler way. And actually, we can write a little rhyme to understand what's going on with this type. So let me write the rhyme out for you as a comment. So what we can say is: a parser for things, is a function from strings, to lists of pairs, of things and strings. This is a little Dr. Seuss rhyme to tell you what a parser is, or what a functional parser is. And that's actually how I remember this type. So that's our basic type now. We've seen so far, a parser is basically a function from strings to trees, but in order to actually program with these kind of things, we need to refine the type a little bit. And this is the type we'll be working with today. But you don't need to worry about the details of it. Just basically think it's a function from strings into trees, or some other kind of structure. What we're going to do now, is we're going to load up the parsing library. So I'll start up the compiler. This is a library, which contains a whole bunch of parsing stuff, which allows us to program with parsers of this form. And this is a parsing library I wrote myself. I'll see if I can get Sean to upload it as part of the video. A parsing library comes with any programming language you can think off. And again, if you just search for parser combinators, functional parsing, whichever language you like, you'll find the library, which gives you all sorts of basic ways of building parsers. And that's what I'm going to show you now. All of these libraries work in the same way. So you have some basic primitives, or basic building blocks, for parsers. And then you have a way of combining parsers to build bigger parsers. So it's like a kind of Lego kit, or a construction kit. You have some basic bricks that you can do things with. And then you can put those bricks, or components, or primitives, together in all sorts of different ways. So I'm going to show you a few of the primitives, and a few of the combining forms. And then we'll do an example. The first primitive I want to show you is very simple. It's just a way of parsing a single digit. So what the digit parser does, is it takes a string of characters, and it tries to consume a single numeric digit off the start of that string. And you might think, well, what does 'parse' do here? What parse does, is it takes a parser, which in this case is just digit, and it takes an input string to that parser, and it just applies one to the other. So of course, this parser here is going to succeed, because we do have a digit at the start of the input string. So we get exactly the expected result. We get a list with one pair. And the first thing in the pair is the actual digit. We get the character '1'. And the second thing in the pair is the unconsumed part of the input. And that's something we could then try and parse subsequently with another parser. We can test, does this thing fail properly? So, if I give it an input string that doesn't have a single digit at the start, then it's going to fail. So if I give it the input string "abc", there's no digit at the beginning, so we're just going to get the empty list of results. So I'll show you one more quick primitive. If I parse a single character, say an 'a', from that string, then that will do the right thing. If I parse single character 'a', and I didn't have an 'a' at the beginning, then it will fail. So we've seen two basic parsing primitives here -- we've seen a way of parsing a digit, a way of parsing a specific character, and we've seen that these things can succeed or fail. And in the library, or in any of these libraries, there'll be a bunch of these basic building blocks, or basic bricks, or primitives, that you can use to build up your parsers. Where things get more interesting is when you think how do you combine these kind of things, how do you use these basic bricks to build actually useful parsers? Let me show you an example of this. So there's a parsing combining form called 'some'. And what it does is it takes a parser as its input, and it tries to apply it one or more times, as many times as possible. So if we're trying to parse 'some' digits, what we're trying to do is consume one digit, then two, then three, and as many as we can until we don't find any more digits. So if we apply the 'some digit' parser to the string "123" then it will do the expected thing. It will consume all three of the digits, and then we'll get the empty string left here. So that's 'some', it gives us a form of repetition. And we also have a very simple way of making a choice as well. So if I want to make a choice here, between a digit, and a letter. And let me parse the string "abc123". So, what we've got here, is this funny symbol here -- with the three symbols -- that's a choice operator. It says do that, or do that. So if I try to parse a digit, or a letter, what it's going to try first is it will take the first character in the input string, and say is it a digit? If so, I'll parse it. And if it's not a digit, then I'll go over to the other side, and say is it a letter? And I will try and parse that. You can see what's happening with this particular example here -- if I look at the first character, it's not a digit, it's a letter. So when I apply the digit parser it would fail, and then the 'or' operator, or the choice operator here, will go over to the other side and say, well, is it a letter instead? And of course it is, so we can parse the single 'a' off the front here, and then we get everything else as unconsumed. And of course, if we wanted to be a bit more clever we could combine some of these things. So I could say, some digit or letter -- get my brackets right -- "abc123", and then that will parse everything. Because all I'm doing here is I'm repeating, or iterating, the choice between either parsing a single digit, or a single letter. And I've got a string of digits and letters here, so I can parse the whole thing. So I've consumed them all. And then I get nothing left at the end. So what we've seen so far, is some basic building blocks, and we've seen a couple of combining forms -- we've seen a way of doing repetition, which is 'some', and we've seen a way of making a choice, which is the funny operator in the middle there. What I haven't shown you so far, is how to do some form of sequencing. And this is the most common thing you typically want to do with parsers. You want to say do this, and then do that, and maybe do that as well. You want to sequence things together. So I'll actually show you a bit of the parsing library here. So here's the parsing library. And I don't want to go through all the details of this, but one thing I want to know, is it's quite short. If I kind of scroll down here, I think it's about four and a half screen fulls. And I've got quite a big font here, and this is actually already quite a sophisticated parsing library. So it shows you the power of this method, that you don't need hundreds of lines of code to write parsers -- four and a half screen fulls is a library which is fully fledged and you can basically implement any parser that you like using this. So I'll show you a couple of examples of sequencing. The first example I want to show you is a parser for natural numbers. So what's a natural number? It's just a non-negative integer, like 0, 1, 2, 3, or 10, or something like that. So you think how do you parse a natural number? Well I'm going to use the sequencing notation for parsers, which is to do notation. And the do notation is very simple -- you write the word 'do', and then you have a whole bunch of parsers one after the other, and it just runs them each in sequence. So the first thing here, is we're going to parse 'some digits'. Because that's the basics of what a natural number is -- it's just some digits. And if that succeeds, I'm going to call all those digits 'xs'. So 'xs' is just going to be a list of all the digits. And then what I'm going to do here to be a bit more flexible, probably when I parse a number I don't want a string of characters back, I actually want the number. So I'm going to pass the string in to a little function called 'read', which just converts the string into a number. And then I'm going to simply return it. So the basic idea here is we're sequencing two things together -- we're reading some digits, or parsing some digits, and then we're translating those into a normal number, and then we're returning it. And we're sequencing those things together using the 'do' notation here. So just one more little example, because we'll use this in a minute. Here's how you could parse an integer. So an integer is either a negative number, or a positive number. So there's a choice there. So we're going to use the choice operator. So here's the 'or' operator, which we've seen a few minutes ago. And the two parts here just say -- have we got a negative number, or have we got a positive number? So the parser for a negative number, we use the 'do' notation, because we need to do three things. So the three things are here. So if we're trying to parse a negative number, the first thing we do is we parse a minus sign. So we're using the 'char' primitive that we saw previously -- that will parse a minus sign. Then we're going to parse a number, and call it 'n'. And then we need to remember that we need to make it negative, so we negate it and then we return it. So again here, we're just using the simple idea of sequencing three parsers, one after the other. And then the 'or' here says, or we can just have a simple natural number. Okay so this illustrates the idea of sequencing. And if you've seen some of my previous videos, you may recognize the 'do' notation here. And this is because parsers form an example of what's known as a 'monad'. And in fact, for me, parsers being monadic is one of the key ways to understand what a monad is. So if you've seen the monad video, or even if you haven't seen it, maybe you have a look back at that, and if you find it interesting. maybe look up some of the work which people have done on monadic parsing. And it's a really good way to get a very good feel for what's going on with both these kind of parsers, and monads as well. We've seen that parsers are basically functions -- they take a string as an input, and they produce essentially a tree as an output. We've seen some basic primitives for consuming single digits, and single characters, and things like that. And we've seen some basic combining forms -- we can have repetition, with 'some', we can have choice, and we can have sequencing. So we've got our basic kind of building blocks for making larger parsers. So let's wrap this up now by doing a little example. And the example I want to do is to build a really simple parser for the kind of expressions, or arithmetic expressions, that we saw back at the start. So things like two plus three times four. So what I'm doing here is I'm writing a Haskell program, which is going to implement this parser. What I've got in the first line here is simply importing the parsing library, which we've just seen -- it's just four and a half pages of code -- it's very straightforward. And what we've got here in the comments, is a simple way of writing down what the syntax, or form, or structure, of expressions are. And this is what's known as a 'grammar'. But it doesn't matter if you don't know what a grammar is, because the basic idea is very simple here. The first line says an expression can be one of two things, so this means 'or' here, in grammars. So an expression can either be a term plus an expression, or it can be a term. And then in turn, a term could either be a factor times a term, or a factor. And finally, a factor can be a bracketed expression, or an integer. So there's three simple rules here, which explain the form, or structure, of what a simple arithmetic expression can be. There's actually quite a lot of things going on here, but the only key thing I want to point out is we've got three rules, because there's three different levels of priority in an expression. So the highest level of priority is brackets. So that's one thing you learn in school, you learn that you do the brackets first, that's the highest priority thing. The middle level of priority here is multiplication, so that's thing in the middle rule. And the lowest level of priority is you do addition, and that's sitting at the top rule. And again, this priority order is something you learn at school -- you do brackets first, then you do multiplication second, and then you do things like addition last. And these three rules are just making that precise. And again, if you want to know more about grammars, you can search on that, and you'll find a lot of information about that online. So what we want to do now, is take this little grammar, and implement it as an actual parser. And this is very straightforward to do, because we're using this functional parsing idea. Essentially, we just take those three grammatical rules, and we just implement them using the combining forms, and the primitives that we've seen. So it's a very straightforward translation. So the first one, is we want to say an expression can either be a term plus an expression, or a term. So let's do the first part of that. So we've got term plus expression. So what we're going to do is parse a term, and if that succeeds call it 'x'. Then we're going to parse a '+' character, then we're going to parse an expression and call it 'y', and then we're going to return x plus y. So there's four things going on in sequence here. We're first of all parsing a term. Then we're parsing the '+' character. Then we're parsing an expression, and we're getting the values x and y -- these will be numbers. And then we're going to simply add those two numbers together. So you can see here we're actually doing more than just parsing -- we're actually evaluating the expression as well. And that's one of the advantages of this approach to parsing, that it's not just about building a tree, you can actually process things as you're going along. And we're actually processing them here by doing complete evaluation. So x and y will be numbers which, result from this term and this expression, and then we just add them together here, and return the result. Then the last part of parsing an expression is we can either be a term plus an expression, or we could be a term. So we just use the choice operator, and we get term. And these five lines here are our full parser for expressions. So we had this one rule up here -- an expression could be a term plus an expression, or a term, and we just translate it directly into our parsing notation. And again the key observation here is that the grammatical rule here looks basically the same as the parser. So let's look at the second rule, and in fact it's pretty much the same as the first rule, except a few symbols are changed. So let me just copy it. And I can just change it. So I can say a term can be a factor times a term, and then I can do a multiply there, or I get a factor. Okay, so in just a few key presses, I've managed to implement my parser for terms. And again the point to note here is that the grammatical rule here looks basically exactly the same as the parser. Okay, I've just got a few more symbols in here, because I'm actually writing a program to do parsing, or actually evaluation, but it's the same basic structure. And then just to wrap things up, I can implement what a factor is. So a factor is either a bracketed expression, or it's an integer. So let me write the parser for that. So a bracketed expression, I just parse a character, and then I parse as an expression and call it 'x', and then I parse a closed bracket, and then I return the 'x'. Or, I can parse an integer. And again, if you look at the structure of the rule up here -- a factor is a bracketed expression or an integer. I've got exactly the same thing down here -- here I'm parsing a bracketed expression, and here I'm parsing an integer. And this is actually our entire parser. We've got three lines up here, and we've just got kind of ten or fifteen lines down here, and this is actually a complete parser, and evaluator, for arithmetic expressions. And again, the beauty of this approach is the parser looks basically the same as the grammar. So let's try and see if this works, and hopefully I haven't made any mistakes. So let's load it into the system. Okay that's great, no errors. And we can see that we've loaded two files in now -- we've loaded the parsing file in, which is about four and a half pages of definitions, and we've loaded the example program in now. So now we can check, does our parser actually do what we want. So let's try out our parser with the little example that we had at the start, 2 plus 3 times 4. So we're going to parse an expression, and the expression is 2 plus 3 times 4. And we press return, and we hope we get the right result, and we do. So remember from school, you do the multiply first, you do the 3 times 4, so you get 12, and then you add the 2 on at the end, and you get 14. So we've managed to get the result 14 here, and there's no portion of the input string left. So we've got a successful result -- we've got a list, and we've got one result value, we've got 14, and we've managed to consume the whole thing. Or we could check, does this actually work with more sophisticated examples? So let's try putting some brackets in, and let's put brackets around the 2 plus 3, so we get 2 plus 3 times 4. So we hope then that we do the addition first, and we get 5, and then times by 4 to get 20. Yes, and it works. We can try more sophisticated examples. So let's do something like 2 plus 7 times 10 plus 8 times 20, and if I've got the brackets right, yes then it works fine. We can also check what happens if you give it something which doesn't parse. So suppose I do something like, I parse expression, so 2 plus 3 times, and I forget to write the 4 at the end. What's going to happen? Well, the parser will still manage to succeed, because it will manage to parse the 2 plus 3, and we'll get the 5 out, but it doesn't know what to do with this symbol sitting on its own. So you get that back as an unconsumed part of the input. And again, we can try another example. Suppose I forget to close the brackets, so I do something like 2 plus 3, and I forget to close the brackets, then it won't know what to do with that at all, and we'll just get the empty string. That's basically it -- this is the idea of functional parsing, or combinator parsing. The idea is very simple -- parsers are basically functions, you define a library with some basic building blocks, or primitives, some combining forms that let you put these things together, and then you can end up writing parsers as we've seen that look very similar to the grammars that you write to describe languages.
B1 string expression input digit tree return Functional Parsing - Computerphile 1 0 林宜悉 posted on 2020/03/27 More Share Save Report Video vocabulary