Placeholder Image

Subtitles section Play video

  • Well, everybody And welcome to part six of the Champ box of groceries with python intense or flow.

  • In the last tutorial, we built our database, we inserted all the roads and at this point, I'm assuming you guys have created a relatively large database of pears on.

  • And then in this tutorial, what I'm gonna be showing you guys is how you can create training data from that database when it's done, if you don't have, if you have less than, like 100,000 pairs, I wouldn't suggest that you continue falling along unless you're just kind of curious to see how things work.

  • So with that, let's go out and get started.

  • So what we want to do here is basically to create the actual training data that we're gonna use for our models we're gonna be using.

  • We're kind of poke around with a few models in the serious, but pretty much it's always gonna be the same kind of format.

  • And the idea is that generally, what you're gonna have is a, um a from file to A to file.

  • So again, this is all kind of this basically a running man.

  • Sure, if this is again, but basically what we're doing is tensorflow sequence to sequence.

  • OK, so whether it's a chap pot, which is a comment in a reply for it, it's a language translation, which is what a lot of the sequence sequence tutorials are doing.

  • Uh, or it's could be anything.

  • Everything in life pretty much boils down to a sequence to a sequence.

  • It's not really so much a ah fixed input to a fixed output.

  • It's variable length input, variable length output.

  • And that's what's really intrigued me about sequence to sequence and especially some of the later implementations of sequences sequence from tensorflow.

  • That's what's pretty exciting about it, anyways, back to Planet Earth.

  • What we want to do is create basically a parent comment file and then and then a reply file where each row or each line number corresponds to the other file.

  • Okay, so Line 15 and the parent would be the initial comment and then line 15 in the sewer line.

  • 15 In the from file is the parent comment and then line 15 in the to file.

  • Is the child the reply to that parent kind?

  • Okay, so in order to do this we're blinded.

  • Import sq like three.

  • We're gonna import pandas as beauty.

  • If you don't have pain is installed.

  • Pip, Install pandas.

  • Uh, then we're gonna have time frames and basically built this with it in mind that I might have many different databases with different times 2055.

  • But really, I think that, you know, you might you might you combine them most likely, if that's what you're gonna do.

  • But anyways, let's leave it that way.

  • And then there's a four time frame in time.

  • Brains.

  • Uh, what do we want to d'oh.

  • So what we're gonna do is we're gonna build this connection Time time frames were going to do is build this connection and then use read SQL from pandas to read it.

  • No, you actually don't need to use Pan is to do this.

  • I'm just gonna use pandas because there might be times when I want to add a little bit more functionality, a little bit more logic Thio to the SQL kind of pull here and or even just data manipulation or whatever.

  • I'm for that reason I'm using pandas here, but for what we're gonna do in this tutorial, Siri's at least for what I know, I have planned out.

  • Uh, I guess you wouldn't mean to use band aids.

  • But how many Hispanics?

  • So anyway, uh, connection will be sq light three dot con act, and we will connect to that date of it, So dot TV Why did it again?

  • Devi not formats.

  • Timeframe seen the cursor is equal.

  • The connection dot curse or, um And then what we're gonna say here is first, let's have a limit equals 5000 Will say last units equal zero.

  • Her length equals limit counter equal zero and test done.

  • Equals falls.

  • You should know what all that means.

  • So limit will be how much we're gonna pull at a time to throw into our pandas data friend.

  • Last UNIX will help us to basically buffer through our database.

  • So we'll pull Will grab the last UNIX time stamp of that pole.

  • And then well, and then from there, we know.

  • Okay, in our next pole, that's a UNIX must be greater than last UNIX.

  • And so we just keep doing that with each pole has a limit of whatever this number is.

  • In this case, it's 5000.

  • Eventually, we could raise that most I want 5000 because test is not done yet.

  • Eso generally.

  • Yeah, you're gonna have a two or a from and then a to file.

  • But you also wanna have testing files, something out of sample, just to see how the models doing.

  • So we're going to use a test file, and that test file will be the first 5000 rows of data.

  • You could make this anything you could do.

  • 500.

  • You get 50,000 Ugo 100.

  • Um, you do whatever you want.

  • I'm gonna say 5000 for now.

  • But, yeah, you can You can do something else if you want.

  • So then we're gonna go ahead and do is we're gonna ask the question.

  • While cur length is equal to basically whatever the limit is, that means we were able to make a poll that that completely exhausted whatever a limit was.

  • So chances are, either there's zero Rose left, but we'll find that out a moment or they're still rose left.

  • So as long as we're able to get our limits worth from the database, we probably have more poles to make, So we'll keep making pools.

  • So then what we're gonna say is DEA for data frame eagles, pandas A p d dot reed s Q l And what we're gonna read boat, we'd ask you.

  • What we're going to say is, um, the SQL statement.

  • So we're gonna select for now.

  • We'll just do all from parent and reply where UNIX is greater than something.

  • Um, And on this should be all caps and parent, not Knoll.

  • And school is greater than zero, but it sure as heck better be order.

  • Bye, units.

  • Ascending limits.

  • Something okay dot Format.

  • And basically, what we need to do is, uh, unit seems to be greater than last UNIX.

  • So it starts at zero.

  • Um, and then limit.

  • Yeah, I guess that's the only for we just did UNIX and then the limit.

  • Yes.

  • That's all the things that we formatted.

  • Awesome.

  • So that's it.

  • Format that.

  • And then finally the other thing when you d'oh p d don't read SQL you pass first the SQL statement and then you pass the connection.

  • So connection Boone.

  • Now come down here.

  • We're gonna say last underscore UNIX equals D f dot tale one.

  • So the last thing UNIX you next dot values zero with Boom.

  • So now we've updated that last UNIX her length.

  • Let's see, what's the length of the data from?

  • It should be whatever the limit is.

  • Now we're gonna ask, um, if not test done.

  • We're gonna with open, uh, Louise is called his test up from with the attention to upend.

  • And we're to specify the encoding as utf eight.

  • Um, as f what we want to do is four content in D f.

  • Uh, parents dot values.

  • What do we want to do?

  • We want it left out.

  • Right?

  • Content plus a new line.

  • Something felt wrong.

  • Content plus new line.

  • Okay, uh, and then we basically want to do the exact same thing with test.

  • Ought to sew with open test two, and then this should be a comment.

  • So those will match now if the test.

  • Oh, well, then also, when we're done, we better say, test done.

  • He was true.

  • Also after this point for really like priority right here.

  • What we would do is, uh, if you wanted, you could update the limit.

  • So we've already done the limit check.

  • So we're going to go.

  • So, um, you could in theory, update limit at this point.

  • Mm mm.

  • No, that would get angry.

  • You have to update cur length and limit temporarily.

  • If you wanted to do that, I'm going to do that.

  • But anyway, if you want to, now would be the time.

  • Next, who were to say is else so assuming test is done, um, we basically do the same thing.

  • So I'm going to copy this paste, Um, and then we're gonna call this train and train again.

  • This is a private best is, like some sort of function where the only parameter is the name of the file.

  • So, like, test or train?

  • I'm gonna pass on that right now, but yeah, we could improve the script by doing that.

  • Uh, now, what we're gonna do is basically wow, that her length equals limit.

  • Let's go ahead and counter politicals one.

  • And then if counter module 0 20 equals zero.

  • Let's print.

  • Um, let's print counter times limits.

  • Rose completed so far.

  • So in this case, counter model 20 So basically is gonna be, like, every 20 times the counter, so you'll see this out.

  • So in this case, I'm sorry.

  • Every 20 times the limit, you'll see this printed out, so it would be 5000 times 20.

  • So 100,000 every 100,000 Rose completed.

  • We're going to get this information.

  • So let's go ahead and say that I'm gonna run it just to see if it works.

  • But, um, I I didn't I don't have a full pole, so I'm just gonna stop this whenever it's done, the division has chosen a smaller number.

  • Okay, Never mind.

  • Okay, so we completed.

  • Let me pause.

  • It's, um you check those files, make sure they are correct.

  • Okay, Um so here we have our files.

  • He's We're testing, so test from test, too.

  • Uh, let's go open that.

  • So test from two.

  • So aren't they streaming it for free online?

  • Yes.

  • Yes, they are.

  • That poor bastard.

  • So I don't know.

  • I guess he bought bought something, so we'll continue down here, so basically, you should have only 5000 rose.

  • So I mean, like, the rose need to be exactly the same.

  • Same thing, though.

  • With trained from and two, we should be able to open those up and again.

  • Like line 28 corresponds to line 28.

  • Interesting.

  • Thank goodness for utf eight right funny.

  • Dodds give the butt of approval.

  • I just keep finding just really golden lines Here.

  • Here's our new line character.

  • Okay, great.

  • So that's what we need to do to get our data.

  • If you did a full pull, obviously, you're gonna have much, much, much more data.

  • In our case, we just have this data here, but hopefully you could have a much, much larger data sets than just this.

  • Such a short little bit of data.

  • All right, so, uh, that's the end of this tutorial.

  • In the next tutorial, we're gonna actually start talking about the models that we're gonna use.

  • There's at least two models that we're gonna be talking about.

  • S O.

  • It's what you guys have to look forward to.

  • If you have any questions, comments, concerns, whatever.

  • Feel free to leave in below.

  • Otherwise I will see you in the next tutorial.

Well, everybody And welcome to part six of the Champ box of groceries with python intense or flow.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it