Placeholder Image

Subtitles section Play video

  • What is going on?

  • Everybody welcome back to another data analysis last night.

  • It's science tutorial with python and pandas were gonna be continuing our work with the avocado data set before we leave it for a new and exciting data set in the next tutorial.

  • Just a few more things I want to show with the avocado data set.

  • Uh, especially just issues you might run into overtime.

  • So let's go ahead and jump in s.

  • So the first thing we're gonna do is we're just gonna basically recreate where we were import pandas as p d d f equals p D ri PT Read C S v data sets of kado dot c S v Albany D f equals D f eagles DF where d f region is all Benny.

  • Not all Pandey often e d f dot set index, as dates in place is, he will be true in there.

  • Finally, uh, Albany DF don Actually, we're just gonna do average price price dot plot And I didn't look up the command, uh, soldiers from this twice and there we have it.

  • So pretty graph.

  • But there's a few issues that we're having right out of the gate.

  • The first issue is these, like, these dates are running over each other.

  • And when you see something like this, it probably means pandas doesn't actually realize it's a date.

  • So the first thing that I would do is convert that to a date time.

  • So when we read in the C S V, if you actually have a date, the pride of the big thing you'd want to do is go ahead and just say, D f date or whatever the column name is equals p dot to daytime, Uh, and then d f date.

  • So, um, this will probably try to think we've done done one in the 1st 1 But I don't think so.

  • Is our first time where we're actually reassigning the value of that column.

  • This is No, this is no different than any other function.

  • Really like.

  • You can just map functions to the to the column here like that.

  • The other thing you could use, like map and apply and stuff like that, and we'll talk a little more about that.

  • But this is a nice built in pan dysfunction that's gonna convert this column.

  • Um, and then we're just reassigning that value because this is actually just gonna return a bunch of values for us.

  • So pretty cool.

  • Anyway, uh, now what if we go to graph that we could see?

  • Okay, they're nice and slanted, and it appears that pandas actually knows.

  • Hey, that these air dates So awesome.

  • So that's one thing.

  • The next thing is, this graph is crazy, crazy, busy looking.

  • So the first thing I would think of to smooth it out is to use some sort of moving average.

  • So, um, if you don't know what, like a moving averages, it's like any point along the way.

  • Let's say we're gonna do a 25 moving average, so we'll take every point we're gonna say, Okay at this point.

  • So this point and the previous 24 points, what's the average?

  • So 25 point average.

  • And then we go one point over and do the exact same thing.

  • We just keep slowly rolling.

  • That and what that's gonna wind up doing is convincing these like crazy fluctuations, and it will hopefully smooth things out a little bit.

  • So the way the entity that is Albany d f on, then we'll do average prize and then dot rolling.

  • So this is like, any rolling thing.

  • Then you would pass a number for the window and then what kind of rolling?

  • So you could say it rolling some, or in this case, we're gonna say a ruling mean and then we're gonna plot it Cool, except not cool, because this is pretty ugly.

  • Something went wrong.

  • So the first thing I would think of when I would see a chart like this is that possibly things were out of order.

  • So what if we said all Benny D f don't let's do a head, uh, 25.

  • So when we look here, we could see Okay, we start.

  • So it's in reverse order, and then we scroll down.

  • And at least in this case, it does look like everything is in proper order.

  • Uh, let's do dot index, but you can see here at least right out of the gate.

  • Um, the index does start at late December and then goes down and reverse.

  • So you would think, Oh, this is reverse chronological.

  • But apparently it's not because then we have 2018 which again goes in reverse chronological order.

  • So something is wrong.

  • So the first thing that I would do is like, Okay, let's make sure these dates are improper.

  • Order.

  • So what I would say is, um let's do all Benny D f dot sort index in place evils troop.

  • You can also sort data frames by specific columns and that will also adjust the index.

  • Just for the record, we do get a little bit of a warning here.

  • Uh, I will talk about that warning in a little bit.

  • So all I want to do now is actually just sort that thing.

  • Craft that again.

  • Okay.

  • And now we get a pretty graph, and I guess the warning doesn't show.

  • I don't know what you've already kind of sort of the index and stuff.

  • Anyway, we'll see the warning soon enough.

  • So now we get a nice, smooth out graft That actually is a little different than the other graph.

  • There are kind of like these two humps, but the second hump is quite a bit higher than the first.

  • Where is in that first graph?

  • It was clearly like, this one is clearly wrong.

  • And then even this one looks to be pretty darn wrong.

  • So, uh, pretty cool.

  • Uh, Now we got that fixed.

  • So the next thing that I want to dio is what if we actually want this to be a column in our data frame?

  • Because we can later output this data frame to a C S V again.

  • And it doesn't have to look exactly the same as it came in.

  • It could be anything.

  • So we decided, Hey, this is a valuable meaningful column.

  • So, actually, what I want to do is copy this column, come down here and I say Albany D f.

  • And I'm just gonna sign a new name.

  • I'm gonna say price 25 m A is equal to that value.

  • Okay, so we'll run that.

  • See that warning again?

  • I will explain that momentarily Albany DF dot head.

  • And then we'll just do like three here and we can see him initially.

  • We do see Nan's.

  • That's because tthe e rolling periods required 25.

  • So the 1st 25 are gonna be nan because there's nothing to calculate later you can set.

  • I think it's literally men periods or something like that, and you can overcome that.

  • Otherwise, what we could say is all been e d f dot tale three and boom.

  • We've actually got values alternatively, um, will almost certainly use this again in the future.

  • But you can say Albany DF dot dropping a for example on.

  • Then we'll just say dot head three.

  • And in this case, what dropping is gonna do for us is remove any rose that have any values and a not any or any Anais.

  • Anyways, if removes, um, Vienna's so later if you have, like, gaps for whatever reason, or you just want to drop missing data for what N a n a N is not a number.

  • So anything that doesn't have values, you can just drop them.

  • And so you could do that.

  • And you can also drop in a in place.

  • It was true.

  • If you want, I'm not gonna drop him here, but just know that you can.

  • So now that we've done that, let's talk about this air.

  • So first of all, it's not in there.

  • I can't believe I just called in an air.

  • Smart part of my point is it's not in there.

  • Um, so a little bit of ah, warning pandas is really happy to serve warnings like you're going to see tons of warnings from pandas.

  • Same thing with psychic learn we're gonna use.

  • I can't learn a little bit later in, like part five, probably in these two libraries.

  • For some reason, I just love to issue warnings about like, deprecation is possibly, like future double deprecation way far in the future.

  • Uh, so you'll see the warnings for, like years.

  • And anyway with pandas, What this warning is telling you is a value is trying to be set on a copy of a slice from a data frame and then it says, Try using dot Lok.

  • Um, I don't think that's actually useful.

  • I don't usually warnings are pretty useful.

  • That will tell you about future deprecation, tze and new functions that you should use or whatever so read them.

  • Um, but I don't really think this is useful.

  • Personally, I don't think that's the right way to do it, But anyway, the warning still stands.

  • Hey, it's just letting you know you're doing something that might screw you up later.

  • So basically what it's telling you is like, as you begin to do work like we've actually kind of forked the original data frame into this Albany data frame and they were changing values.

  • And what Panas is trying to tell us is like if you were just to go perfectly in order and on Lee modify a data frame and always keep it D f you would never have a problem.

  • But if at some point you fork and then you start changing values over here, in here, in here, well, don't forget that that original it came from that original data frame.

  • And you might be either one changing values that you didn't expect or not changing values that you did expect.

  • So that's what that warning is there to tell you.

  • So how could we not see this warning?

  • Because later you might be doing things and you might be iterating, and then you'll see like hundreds of these warnings.

  • So one way that you could do is by telling pandas I get it.

  • It's a copy.

  • So how could we do that?

  • Well, I'm so glad you asked Joan Allman, E d f equals d f dot copy.

  • And now what that does is returned to us a new copy of the data frame, because later we could modify that data frame.

  • Um and it won't impact Albany DF but pandas wants us to know it won't impact Albany DF.

  • So instead, we just do DF copy which returns the data frame.

  • So now we can reference that, you know, we can treat it just like a data for him right away.

  • So we could say what d f dot copy?

  • Where d f, um, region region equals Albany.

  • Then we do all the same things that we did before.

  • So, for example, we can take this, and, uh, I can't remember when we actually okay, so we set the index afterwards, So let's go ahead and set the index.

  • And we also sorted the index.

  • I believe afterwards.

  • So set index, sort index.

  • Let's just run that real quick.

  • Okay?

  • Cool.

  • Uh, and now we don't have any warnings.

  • So basically, I just wanted to show you guys how you get away with, you know, not sure in the mornings.

  • Okay, so we've got this graph Now, the next thing you might think about is like, Hey, I want to graph all the regions on one graph and see them compared.

  • Okay, so how would we do that?

  • So the first thing we we need to do is somehow iterated over all the regions butts.

  • We have to figure out all of the region's.

  • So the cool thing about pandas is at the end of the day, it is rows and columns, and you're Each individual column is just an array.

  • So Pan is at the end of day can be an array, you know, a multi dimensional array homicide, array of a raise, like a list of lists.

  • Anyways, um, it could be a multi dimensional, right?

  • So And how might that happened, by the way, uh, for example, because they left out values boom.

  • And now you can see this has actually been converted to an array.

  • We could do the exact same thing with a column so we could say, D f region dot values and we see we get an array.

  • Now it's, uh, just doesn't give us the length.

  • But what we can say is, Well, it's gonna be the length of the data from which I think was like, 18,000 or something.

  • It's not gonna be interesting to know that.

  • So Wayne also convert it to a list to list, and then you can see okay, We got quite a bit values here.

  • Too many values.

  • So how do we get, like, just the unique items?

  • While one option would be quite literally just converted to a set and now we can see.

  • Okay, these are all the regions, and then we could convert it to a list two, and then we could generate over that list.

  • So Okay, the problem is, that's messy, and there's a simpler way.

  • So a lot of times in panda's, there's just an easier way than whatever we're doing.

  • So in this case there is it will be DF region dot unique, done.

  • So on this way we can get all the unique values.

  • And then we could actually generate over these.

  • And for that reason to, uh, don't forget about the panties docks scroll through the docks.

  • Just get an idea for all the things that are available to you before you start trying to do it in python and moving it back because again, if you can keep things in panda's, it's probably gonna be faster than what you write.

  • What you write will may end up with the exact same outcome, but pandas is likely painters is using c++.

  • You're gonna be using python, which is slow.

  • So So Yeah.

  • Okay.

  • Do you have region unique?

  • Cool.

  • So we have the unique values.

  • Now, how might we graph every single one?

  • So long pants are on Metal it lib.

  • Basically, you've got this canvas in the background and then any time you plot something, it just kind of ads.

  • But we're in this, like Jupiter notebook.

  • So it's gonna like not yet.

  • Well, it's Jupiter lab now, but anyways, uh, probably what we want to do is actually create a new data frame, and they just plot that data frame.

  • So, really, our task here is these regions are values in rows, and instead what we want to do is actually almost, like, reshape our data frame to be a data frame where the columns are the regions and the rose.

  • Well, the column headers are the regions, and the values of those columns are the let's, say, the 25 moving average and then the indexes date.

  • So we just kind of want to restructure it, basically.

  • So the way we were gonna do that is by iterating over the region's sort of graft and basically using all the same code that we've used up to this point.

  • Um, I'm gonna make some space.

  • I was high.

  • I hate being like at the bottom, So okay, graph DF There's also ways that we can do this, but this is generally the way I kind of like to start such an operation.

  • So p d dot data frame now.

  • Ah, four region in D f region dot unique.

  • Now we just want iterated over all of the region so we could let's this print regions or so we know where we are.

  • And then we want to do basically all the things that we've kind of done so far.

  • So we're gonna say region d f equals d f dot copy.

  • And we're gonna say D f dot copy where d f region is equal to whatever the region is that we're working with.

  • And in fact, I wonder if I could just gonna copy this copy.

  • Come down here.

  • Pasta tab, Tab region, DF region, DF region, DF and region DF.

  • Okay, I think it's good.

  • So uh, okay.

  • So now what we want to d'oh is let's just say if graff DF dot empty So basically, if this data frame doesn't have anything in it, then all we're gonna say is graft e f equals region DF and all we want is that 25 m a column.

  • So, um, in fact, Okay, we got it.

  • We go fix one thing, but for now, I'm going to say it's price 25 in May, right?

  • That's the one that we want, but a couple of things here.

  • This will return a Siri's.

  • We don't want a Siri's.

  • We want this to be a date a frame, so we can continue adding new columns.

  • So the way that we take this is a slice this Hey, jerk, Not cool.

  • This, that that's a data frame.

  • So the only difference is we've got two brackets and then later, if you wanted to specify multiple columns, use a price m a and then you go with the average price, like for the raw data.

  • But I don't actually want that.

  • I just want the price of May and then we're gonna say so if it's empty, that is that else.

  • Where is a graph?

  • T f equals graph?

  • DF join.

  • So this is how we can bring data frames together on an index, and then we would want to join region D f.

  • And then that price 25 m.

  • A column price.

  • 25 inmates.

  • Now, the problem is, Andi, if you want to learn more about joint, check out the docks, but basically, join is just gonna be used where you want it.

  • You have to date of frames and they're indexed in the exact same way.

  • Then you can just call.

  • You can say the first d f dot join the second D f, and then they will be joined on their index.

  • So, um, the one problem we're gonna have is we've got many regions like, I don't know, 50 or something.

  • Uh uh, and they all have the same column name.

  • That's gonna be a problem.

  • So instead of calling them all the same, I'm gonna use f strings, and then we're actually just gonna do underscore, and then we're going to say whatever the region name is.

  • So this is how we're gonna give it a unique column name every time.

  • So copy that.

  • I'm gonna paste it into here and then finally into there.

  • So with that, we can join them off.

  • So, uh, I'm gonna go ahead and run this already.

  • No, it's not gonna work.

  • Oh, actually, before I read it, I'm gonna just say up to, uh, 6 26 16 Uh, okay, so run that, and then we can actually see here.

  • Uh, that it's it's slowing down.

  • We got to Grand Rapids, got great legs, and then I don't know if it's done yet.

  • Let me see.

  • Yeah.

  • OK, now it's done.

  • But the first bunch came really quickly, right?

  • And then it slowly starts brining.

  • So, um so, yeah, that's relatively interesting.

  • So, um, I was trying to figure out what the heck is happening.

  • This is, ah, logic that I've used for years, and I've never had this issue.

  • And if I watch the process is like if I actually opened up like task manager or something, this is exploding Ram.

  • Now, what I don't know is why explodes, Ram.

  • What I did find out is why So if we look at, um, if we look at our data frame, uh, let's just graph.

  • Let's just do d f dot tail three.

  • If we look at her data frame, um, everything looks pretty hunky dorey in all set.

  • The problem comes with right here.

  • So here we have separate pl use, right, So each unique pl you has its own column.

  • What doesn't have its own column is type, so you have two different types.

  • So if we said D f type, don't, uh, let's just do unique again.

  • There's actually two types conventional and organic.

  • This is what's causing our problem because we've got duplicate dates.

  • So it's expected that our index is the same.

  • But the problem is, well, we wind up having multiple dates all the same.

  • So, for example, if I say graph DF dot tale and run that well, these are all the same date.

  • And so when we go to actually join pandas is looking for OK, where should we join this?

  • Well has many indexes that are identical, so it's like I don't know what to do.

  • Oh, so, um, so again, I don't know the underlying reason why Ram is exploding so much, but I know why, and it's because of the date.

  • So instead, what we need to do is when we bring in, uh, I don't know why we gotta go all the way up to the top, huh?

  • Um turned aside so d f copy.

  • I kind of wanna I guess I just want to get it at the source, To be honest with you guys.

  • Eso probably what I'll do is yes, we convert to date time.

  • So I'm gonna take this.

  • And in fact, let's just let's take all this and then come down here and just kind of re do what we should've done.

  • So import pandas as P d.

  • D.

  • F.

  • Was peed out.

  • Read CSB.

  • That's fine.

  • And then let's just immediately sit pick either conventional or organic.

  • It really doesn't matter.

  • But I must say, d f equals d f dot copy.

  • Otherwise, it's gonna be like, that's exactly what it sounds like by the way d f type equals and then we'll just go with or Ganic.

  • Okay, Then we set the date, and then, uh, let's do a d f.

  • So, for example, uh so while while I have you guys, I could just say d f that sort values and we're gonna say D if that sort of values by equals dates and then we're gonna say us ending equals true, I believe Buyten.

  • Normally, it's actually false, I think is the default.

  • And then we'll say in place equals true.

  • So So this is just how you would sort a data framed by a certain value.

  • So rather than because before we I should you short sort index.

  • But I said, Hey, you could sort by columns too.

  • Here's how you do it.

  • And then let's do Do you have a head?

  • Cool.

  • And now, um, Cohen.

  • Just Well, it's just a rate over this.

  • So now what I want to do is come back up here because we're actually setting index here anyway, So let me take this from appear copy.

  • Calm down here, def Sort values should be good there.

  • Uh, let's just run that right quick.

  • Make sure, Okay?

  • I think so.

  • I mean, that was much quicker s.

  • So now I'm gonna get rid of this, uh, thing that limits us draft.

  • Yes.

  • And then let's just let me Well, at the end here, let me just say graph DF dot deal.

  • So run those.

  • Uh, yeah, it's definitely going through that, like, really fast.

  • Okay, So now finally, we can plot this.

  • So first, you know the first thing that we might say try to do is graff DF dot plot two issues.

  • One is the legend is just too crazy.

  • The other thing is, the graph isn't quite yet big enough.

  • So if you if you want again, pandas can always be converted out to be an array or list or whatever.

  • So if you know Matt plot live, you can make crazy grafts just how you know with Matt plot lib.

  • But with pandas like for example, dot ply, that's some magic.

  • We never even imported Matt plot lib, so you can either just use what they give you.

  • You don't have to put up with this, or you could make custom at plot lithographs.

  • If you want to learn more about Matt plot lib, you told.

  • I've got, like a total super long Siris on stewed data analysis state it is.

  • Maybe I thought I clicked it anyway.

  • There we go, Uh, and you can go through, uh, this tutorial.

  • There's quite a few parts, and you can learn all about customer shorts, but instead we can just say a few simple things here so we can say fix eyes and we'll make this 85 That's why it's an inch is apparently and then legend.

  • We're gonna set that equal to but, uh, no idea why I messed that up.

  • Anyway.

  • Legend equals false.

  • Beautiful.

  • So this is all of them compared to each other?

  • Yes, it would.

  • Might be useful tohave the legend, actually.

  • Ah, but it is what it is.

  • Why do we have this empty gap?

  • We talked about it already.

  • It's because of the values.

  • So we can actually say dropping a and then plot it.

  • And then now you don't have that gap there.

  • Anyway, Um, I think that is a stopping point.

  • We could definitely poke around further.

  • We could pry do this entire Siris on the avocado data set that might get boring.

  • So in the next tutorial, we are going to visit a new data set with new challenges.

  • So, um, yeah, if you got questions, comments, whatever.

  • Leaving below.

  • Also shout out to my most recent channel members is obey a shower.

  • Carlos Daniel Agar and zoo Key.

  • More one.

  • Thank you guys very much for your support.

  • You guys are amazing.

  • Uh, I will see you guys next time.

  • Questions, comments below.

  • Come chat with us in discord.

What is going on?

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it