Placeholder Image

Subtitles section Play video

  • COLTON OGDEN: Hello world.

  • This is CS50 on Twitch.

  • My name is Colton Ogden, and today I am joined for the first time by--

  • ANDY CHEN: Andy Chen.

  • Nice to meet y'all.

  • COLTON OGDEN: So Andy here--

  • tell us a little about what would you do here

  • on campus, which you're involved in.

  • ANDY CHEN: Sure.

  • So I am a master student studying bioinformatics.

  • I'm also a special student in computer science,

  • and I actually work at HarvardX, so if you guys

  • are familiar with the online learning platforms of Harvard, that's

  • one of the offices that has really good resources.

  • COLTON OGDEN: I feel like you--

  • didn't I-- I met you in the spring, I think.

  • You came to the fair for the Supreme Court.

  • ANDY CHEN: That's right, yeah.

  • COLTON OGDEN: And I think you were talking

  • about something like that, yeah.

  • Pretty exciting.

  • And what are you going to talk about today?

  • ANDY CHEN: Well today we're going to talk about a programming

  • language called R, and one of the things you

  • can do in it which includes biostatistics.

  • COLTON OGDEN: Oh yeah.

  • ANDY CHEN: Cole, you might ask me what is biostatistics?

  • COLTON OGDEN: What is biostatistics?

  • I actually--

  • ANDY CHEN: So it's really statistics in the field of like biological data.

  • But a lot of people use it in the context of epidemiology,

  • as opposed to more like molecular biology kind of things.

  • And that's actually what we're going to be dealing with today.

  • COLTON OGDEN: That's diagnosing diseases, right?

  • Epidemiology.

  • ANDY CHEN: Epidemiology is sort of the study

  • and the practice of response to the spread of diseases.

  • COLTON OGDEN: Got it, OK.

  • Makes sense.

  • ANDY CHEN: Right.

  • COLTON OGDEN: I clearly don't--

  • I'm not an expert on biology or biostats.

  • ANDY CHEN: But you will be soon.

  • COLTON OGDEN: Yeah, I'm very excited.

  • We have a lot of people in the chat that have joined us,

  • and were talking before we started a little bit in advance.

  • Thank you very much to everybody who's joined.

  • Regulars of ISO TV.

  • There's a new regular, Asley, Newanda33333, belacures,

  • m.kloppenburg, thank you for joining.

  • Let me make sure I didn't miss anybody up above that.

  • Techytack, hello.

  • [INAUDIBLE] and fatma, thank you for joining

  • the regulars and everyone she's saying.

  • Really curious about about this one says m.kloppenburg.

  • This is the first time we've had anything kind of statistics

  • related onstream.

  • ANDY CHEN: Oh, exciting.

  • COLTON OGDEN: Python's obviously a language

  • that's very often used in bio or in stats, generally speaking.

  • But R kind of like the language that people I think, maybe most people

  • associate-- or at least they associate starts with R, and then R also

  • sort of with stats that end with Python, too.

  • I don't know anything about R, so I'm actually

  • very curious to see what it looks like, what the environment looks,

  • what we can do in it.

  • I think we've caught up on all the comments.

  • Everybody's saying hey Andy, nice to meet you Andy, everybody saying,

  • so you got a lot of friends in the chat there.

  • Yeah, so thanks so much everybody.

  • Let's go to your screen here, so we have your screen set up.

  • And why don't you get us started here.

  • ANDY CHEN: Sure, awesome.

  • Thank you very much Colton.

  • Hello everyone, hello friends from all over the world.

  • So R, like Colton was saying, is one of probably two languages

  • that are very popular for statistics or data science kind of things,

  • Python being the other one.

  • Today we're going to be looking at R, which let's go to the website.

  • So bring up a browser if you will.

  • The first thing we're going to be doing is installing the language itself.

  • Now notice that we actually are not going

  • to be working in R, which on Mac OS 10--

  • well actually I don't know what I'm running,

  • but whatever-- on Mac you have to install R the language itself,

  • which actually I think does have a command line interface.

  • But we're going to be working in R Studio, which

  • is an integrated developing--

  • developing environment?

  • COLTON OGDEN: Integrated development environment.

  • ANDY CHEN: Development environment, thank you.

  • COLTON OGDEN: It's a mouthful.

  • ANDY CHEN: It's a mouthful.

  • I'll just keep calling ID.

  • COLTON OGDEN: ID.

  • That's why we call it an ID.

  • No one wants to say all those words.

  • ANDY CHEN: But yeah, so we're going to be installing

  • R, which is the language itself, as well as R Studio, which

  • is the IDE in which we'll be working.

  • COLTON OGDEN: What are the links that we can go to,

  • and I can toss them in the chat as well.

  • ANDY CHEN: Great So the first one is going to be www.r-project.org.

  • COLTON OGDEN: OK.

  • ANDY CHEN: The second one is going to be rstudio.com.

  • And the last one--

  • COLTON OGDEN: The former being the language

  • itself, the latter being the IDE, the R IDE that you're alluding to?

  • ANDY CHEN: Exactly.

  • COLTON OGDEN: OK.

  • And babicnight also in the chat, and Andre Jacob Johnson, and Irenae,

  • thank you very much for joining us, everybody.

  • Well, some more regulars.

  • And babic, to answer your question, not late at all.

  • We just started.

  • We're now tossing in some links into the chat for downloading R and RStudio.

  • So r-project.org and rstudio.com.

  • ANDY CHEN: Thank you.

  • So what we're going to be doing today is working

  • with the M Heinz data set, which is actually kind of difficult to--

  • it's freely available.

  • It's a US governmental data, but it's actually hard to parse its raw format,

  • so what we have provided today is a .text file,

  • which I've uploaded to this link.

  • I don't--

  • COLTON OGDEN: We can make a bitly for it.

  • So what is the--

  • do you want to email me the link, and then I'll toss a bitly into the chat.

  • People can click on it, and then get access to it later on YouTube.

  • ANDY CHEN: Absolutely, yep.

  • And then so while I'm doing that, let's see.

  • Let's get to email.

  • Oh, man [INAUDIBLE] piling up.

  • COLTON OGDEN: If you want I can go and go here,

  • so people can see your personal email.

  • ANDY CHEN: Oh, yeah.

  • Some people like that.

  • COLTON OGDEN: Lots of juicy tidbits in there.

  • Everybody just go ahead and look through Andy's email.

  • Yeah, we'll get to it we'll get a bitly for everybody in the chat,

  • if you want to email that to me.

  • ANDY CHEN: I think it should be sent.

  • COLTON OGDEN: OK, and we refresh.

  • Just one sec, everybody.

  • Sorry for the delay, but this will be a lot better than typing

  • a super long mega upload link.

  • OK, so here I go I got the link.

  • I'm making the link.

  • I'm going to copy it to bitly.

  • If it wants to cooperate.

  • Copy it over to bitly, paste it in there.

  • Get rid of these stupid messages.

  • ANDY CHEN: They're just try to show you love, Cole.

  • COLTON OGDEN: A little bit.

  • Aw, crap.

  • OK.

  • Here, we're good.

  • We got this.

  • And edit bitly, get a a copy.

  • Can customize it?

  • I can't.

  • So we're going to call this bit.ly/biostats_stream.

  • And so that will be how it works.

  • Clear all these messages, save that.

  • It's going to-- I'm going to copy that.

  • I'm going to go to the chat, and I'm going to paste that in.

  • So now if you go to this bitly url.

  • So it's a bit.ly/biostats_stream.

  • And let me-- and Asly says, Andy is such a Hufflepuff with a heart emoji.

  • ANDY CHEN: Thank you.

  • COLTON OGDEN: So bit.ly-- let me make sure this is working--

  • /biostats_stream.

  • Yep, it works perfectly.

  • ANDY CHEN: Good.

  • And if anybody is curious, go to bit.ly if you want to have a really long url,

  • and you want to shorten it down, you can do that at bit.ly.

  • Bitly, as it's called.

  • COLTON OGDEN: Indeed.

  • OK.

  • ANDY CHEN: I think we have a few comments.

  • Nuwanda333, I am a Hufflepuff.

  • Thank you, I think.

  • I think it's a good thing, right?

  • COLTON OGDEN: I think so.

  • I think the Hufflepuffs are--

  • I actually don't know what the--

  • ANDY CHEN: I think they're the catch all.

  • COLTON OGDEN: I think they're the friendly--

  • like I actually honestly don't know too much about it.

  • ANDY CHEN: OK, well I'll take it.

  • COLTON OGDEN: All I know is that they're friendly.

  • ANDY CHEN: Sure.

  • And then in response to TwitchHelloWorld.

  • So I am doing my masters in bioinformatics.

  • COLTON OGDEN: And did you say you're doing with NCS?

  • ANDY CHEN: I'm a special student in the graduate school

  • of Arts and Sciences, which means I am sort of a visiting

  • student within the university.

  • COLTON OGDEN: Mm.

  • OK got it, got it.

  • Makes sense.

  • Only three megabytes, says faceless voice [INAUDIBLE] surprised.

  • The data set is roughly small.

  • ANDY CHEN: Yay.

  • It's just text.

  • COLTON OGDEN: No big data today.

  • This is small data.

  • Sort of.

  • ANDY CHEN: It's approximately 10,000 entries, if I recall.

  • COLTON OGDEN: That's actually pretty sizable.

  • ANDY CHEN: Actually I think it's exactly 10,000 entries.

  • COLTON OGDEN: Oh, wow.

  • Next we'll find the number.

  • ANDY CHEN: We'll find out.

  • COLTON OGDEN: Forces me to install a stupid Chrome add on.

  • ANDY CHEN: Oh, don't do that.

  • It's so-- actually let me see if I can show what it should look like.

  • Do not install the Chrome add on.

  • It's-- here.

  • So can we--

  • OK great.

  • So if you hit it once, and it should, da, da, da--

  • I don't know what's going on.

  • No, no, you don't want this.

  • Do not do this.

  • This is-- OK.

  • I think-- there we go.

  • So you want it actually--

  • COLTON OGDEN: I guess you got to click on it twice.

  • Just don't install the add on.

  • The upload is a little bit shady.

  • ANDY CHEN: It's a little--

  • I wouldn't trust it.

  • Those kiwis-- no, I'm just kidding.

  • COLTON OGDEN: Trying to get some ransomware on everyone's computer

  • today.

  • Make a little extra money on the side.

  • ANDY CHEN: That's actually what they pay me for.

  • It's my real job.

  • But all right, let's get back to R. The first link on r-project.org is

  • downloading the language itself.

  • All right, so what we're going to do is under the Getting Started section

  • we're going to go to Download R, this link right here.

  • We're going to click it.

  • And then, so these are--

  • COLTON OGDEN: We maybe we want to command plus a couple of times,

  • just so we can see a little bit.

  • It's a little small.

  • ANDY CHEN: Let me see if I can-- yeah.

  • Is that better?

  • COLTON OGDEN: Yeah, this should be pretty good I think.

  • ANDY CHEN: Cool.

  • So CRAN is the Comprehensive R Archive Network,

  • which is a bunch of mirrors for where are different distributions of R

  • are stored.

  • And so I'm going to go to one that is closest to me.

  • It doesn't really matter, but what, Massachusetts.

  • Probably CMU.

  • Pennsylvania, I think that's the closest one.

  • Well, if we go back.

  • CMU Pittsburgh.

  • Pennsylvania is pretty closest.

  • COLTON OGDEN: It's probably, it's pretty close.

  • ANDY CHEN: I mean doesn't really matter.

  • COLTON OGDEN: It doesn't matter too much.

  • If you're-- maybe if you're abroad it might make a little bit more

  • of a difference, a download speed difference.

  • But yeah, I choose the mirror most appropriate to you, to your country.

  • We do have a lot of people tuning in from all over the world,

  • which is always super awesome. [? RobertSpiri ?]

  • thank you for joining me.

  • We are doing some biostats in R with Andy Chen.

  • ANDY CHEN: Hello.

  • A newcomer to the stream.

  • And we just got everybody sort of situated with the data set.

  • So if you're not equipped with it yet, you can go to this URL.

  • bit.ly/biostat_stream.

  • ANDY CHEN: That's a great title.

  • COLTON OGDEN: It works.

  • It wasn't taken, thankfully.

  • No one's done a biostat stream.

  • ANDY CHEN: This the first ever.

  • COLTON OGDEN: First ever in history.

  • ANDY CHEN: Wow, I'm into it.

  • I feel like I should like frame this moment, put it over my bed.

  • COLTON OGDEN: I think you should.

  • But let's get back to the R before I get too distracted.

  • So once we've clicked on an appropriate mirror for you,

  • I am going to download R for Mac OS, because that

  • is the operating system I'm running.

  • And then it's-- we are I think 351 should work,

  • but we'll go with an older version, just to be safe.

  • I'm going to do 333.

  • To download that.

  • So on the Mac OS that download at .pkg, which I think actually is like sort

  • of a custom installer.

  • So it's currently downloading, and then once that's done downloading,

  • I'm going to click it.

  • I'm going to double click it.

  • There we go.

  • COLTON OGDEN: Go through all the steps.

  • ANDY CHEN: And then just--

  • if you want to be particular about it you should feel welcome to.

  • It-- oh, so one thing that is actually very attractive about R in the industry

  • is it is a commercially-- it's a free open source software.

  • So unlike a lot of commercial statistics,

  • some industries that do allow statistics will prefer certain languages

  • like Stata or SaaS, but R is popular in certain industries, and in academia

  • because it's free.

  • COLTON OGDEN: Makes sense.

  • And hence maybe why it's becomes so popular as of the last few years.

  • ANDY CHEN: Absolutely.

  • COLTON OGDEN: Facelessvoice in the chat is saying, what is biostats?

  • ANDY CHEN: Yeah, that's a really good question.

  • So the way that using the term today is referring to applied statistics

  • with biological data.

  • And we talked about this earlier in the stream, that could technically

  • be used in the context of molecular biological data,

  • but today we're actually going to be looking at epidemiological data.

  • Which to reiterate, is the study and the practice

  • of response to disease and its transmission.

  • COLTON OGDEN: OK.

  • Makes sense.

  • Aren't the majority of languages free, says facelessvoice?

  • ANDY CHEN: So Stata and Saas are not free, I don't think.

  • COLTON OGDEN: Like Matlab I think has an expensive license.

  • ANDY CHEN: I think so, yeah.

  • COLTON OGDEN: Most languages are free, I would probably say.

  • I think it's also the environment is usually a big part of it, too.

  • ANDY CHEN: That's true.

  • The environment.

  • COLTON OGDEN: But in the context of bio stats it sounds like--

  • in the context of statistics it sounds like there

  • are relative to maybe the rest of CS, some languages

  • are environments that are not free.

  • ANDY CHEN: That's-- yeah.

  • COLTON OGDEN: That's kind of why making the case here is important.

  • ANDY CHEN: I think that's absolutely true.

  • And as we'll see and a little bit, I--

  • this is still installing.

  • Great.

  • We have time to talk.

  • Ooh, no we don't.

  • Statistics is sort of--

  • the way that you can use R is almost like a giant calculator.

  • Which is different from certain programming languages, which

  • is why it's very popular in certain--

  • in industry and statistics, because you can just plug it in and plug it out

  • without having to think about like scripts or scary CSE kind of things.

  • COLTON OGDEN: Makes sense.

  • [? Ahmet Osman ?] says, this is an awesome stream

  • by chance of participating in an MIT hacking medal.

  • Do giving you advice or recommendations?

  • ANDY CHEN: I actually I think I was thinking it apply to that,

  • but I missed the deadline.

  • The last MIT hackathon I went to was a VR/AR AR/VR hackathon?

  • Advice.

  • Are you local?

  • It would be helpful to know if--

  • do you have any specific questions on what kind of advice you would like?

  • Because if not my first thing is just have a lot of fun

  • and make a lot of friends.

  • And get swag.

  • COLTON OGDEN: Also shout out to the invisible can of seltzer.

  • ANDY CHEN: It's the color of my face.

  • COLTON OGDEN: You can put it in front of your-- actually [INAUDIBLE]

  • because all it does, it shows the background of the grass.

  • Because you're already looking at a green background,

  • but if that was for example like a red background,

  • it would probably be a little bit easier to see that it's invisible.

  • [? Twitch Hello World, ?] is it also applicable to epidemiology in terms

  • of treatment and not just spread?

  • ANDY CHEN: To epidemiology in terms of treatment and not just spread.

  • Hm.

  • I suppose you could probably get some interesting data out of--

  • well, actually I'll say it this way.

  • I think a lot of epidemiologists will say that the treatment requires

  • understanding the situation, understanding the context,

  • and the only way to do that, or one of the best tools we have to do that,

  • is through statistical analysis.

  • And so treatment-- in the real world epidemiologists

  • have to face sort of issues of, this might be the best medicine,

  • but is it cost effective?

  • Can we get it there in time?

  • There's a lot of logistics involved.

  • And if you have statistical data about how the disease is spreading,

  • where it's spreading, and what actual demographics are being affected,

  • you can make really good logistical and business decisions

  • that will maximize the medical impact that you do have.

  • And so in terms of treatment, not in the development of treatment,

  • but absolutely in the execution and the decision making of what treatment

  • is probably best.

  • COLTON OGDEN: Makes sense.

  • Biostats is equal to epidemiology plus statistics,

  • definition from [? Vert ?] [? Lu's ?] school.

  • ANDY CHEN: I-- you know what?

  • I like that definition, yeah.

  • I think all definitions need to be a little fluid, because people use them

  • in different ways in different contexts, and they evolve over time.

  • But I like that definition.

  • COLTON OGDEN: Andre, what would you say are the biggest advantages of R

  • over Python?

  • ANDY CHEN: Hmm.

  • I actually am much, much more into Python than I am into R.

  • But the biggest advantage of R over Python?

  • Oh, piping.

  • You can do lots of really cool things with piping, which

  • is like sort of feeding processes.

  • It's kind of hard to explain, but I think we'll talk about it later.

  • And the other thing is, I think R is a little more

  • approachable to people who are--

  • Python is a little more--

  • it's actually less abstract, but it's a little more similar

  • to traditional computer science--

  • like programming language and environments--

  • to the point where I think a lot people are uncomfortable getting into it.

  • It's like, oh, computer science.

  • Whereas R is very-- the GUI is very much as we'll see--

  • the GUI for R studio is a pretty familiar environment to at work in.

  • You can just use it like a calculator.

  • COLTON OGDEN: That makes sense.

  • Great.

  • ANDY CHEN: So lets--

  • I downloaded.

  • Oh, R is installed.

  • So if I check R, it's installed.

  • Great.

  • So, well actually let's open that.

  • So R itself does have a command line environment

  • if you want to work in it by itself.

  • But I don't know where it went.

  • COLTON OGDEN: I think it came up and then instantly--

  • ANDY CHEN: It died.

  • COLTON OGDEN: Yeah.

  • ANDY CHEN: Uh-oh.

  • I hope that's not-- that's not even working.

  • OK well, we'll get to that when we get to that.

  • Let's go to RStudio.

  • All right, so again this is the integrated development environment

  • in which we'll be working with the R language, and to install RStudio

  • we're going to go to this link, which is again rstudio.com,

  • going to choose Download under RStudio.

  • So this is RStudio right here, and we're going to choose this link here,

  • which is Download.

  • All right, and we're going to choose the RStudio that's top open source license,

  • and as Colton was saying earlier, some of the IDE insert languages

  • do cost money to use, and RStudio is a common--

  • is a popular one because it's also free, sort of.

  • Depending on your usages, as you'll see here.

  • COLTON OGDEN: Yeah, it looks like they do

  • have different licenses, different commercial licenses and whatnot.

  • They get pretty expensive.

  • ANDY CHEN: Yeah, $30,000 a year.

  • COLTON OGDEN: That is expensive.

  • ANDY CHEN: That's about how much I dropped my boats every month.

  • I wish.

  • COLTON OGDEN: Making that sweet stats money.

  • ANDY CHEN: Although if you call it data science, slap data science on there,

  • you make a lot of money that way.

  • COLTON OGDEN: Yeah, pretty much.

  • ANDY CHEN: All right, so it brings you down to here,

  • and so again I'm running Mac OS.

  • So I'm going to download that.

  • And then this is more of a traditional-- at least

  • I don't know how the distribution is for the other operating systems,

  • but this is more of a traditional Mac type installer.

  • So it's just like, it comes up, and then you drag and drop

  • into your applications folder, and then it bounces an image I think.

  • Once it's done downloading.

  • Great.

  • COLTON OGDEN: Yeah, looks good.

  • The DMG?

  • ANDY CHEN: Yeah, DMG, exactly.

  • COLTON OGDEN: If you're on the Windows, there

  • would probably be something somewhere that-- they'll probably

  • have an installer, an MSI that as an application of some Program Files

  • folder.

  • ANDY CHEN: So that's in there.

  • COLTON OGDEN: But then they make it easy, is what it looks like.

  • ANDY CHEN: Right, yes.

  • COLTON OGDEN: Relatively.

  • ANDY CHEN: Oh, I guess I'll check some other way.

  • Ooh, my secrets.

  • COLTON OGDEN: Well your text messages.

  • ANDY CHEN: Uh-oh.

  • COLTON OGDEN: A deep dive in your text history here.

  • ANDY CHEN: Ruh-ro.

  • COLTON OGDEN: OK, [? Ahmet Osman ?] says I'm Egyptian living in Saudi,

  • and they're coming here and I just got the invitation.

  • Actually today after 12 hours going to start.

  • The advice I'd be interested in is, how could the stream benefit me

  • in the context of health care buisness.

  • I'm also an MIT Enterprise forum competitor for 2014,

  • and hell yeah I was a lot of fun.

  • ANDY CHEN: Nice.

  • OK, let's see.

  • [INAUDIBLE] I got the invitation [INAUDIBLE] 12 hours.

  • Oh OK, so it's coming soon.

  • What would be [INAUDIBLE] how this [INAUDIBLE] in context of health

  • care business.

  • This is hacking medicine, right?

  • I think unless you have a very specific niche in mind, a topic or a field

  • that you want to go into in health business, probably

  • the single best thing that you can get out of a hackathon

  • like this is just the network.

  • Right?

  • Spend as much time as you can-- well obviously focus

  • on your project, whatever your hack is.

  • But meet cool people.

  • People come there because they're smart, they're passionate, they're driven.

  • And so there are very few opportunities in life

  • where you can kind of have in like what, 24 hours,

  • meet potentially a hundred people who are really interesting, really smart.

  • And if they don't have specifically what you need, they might later on,

  • or maybe they know someone who can.

  • But unless you-- the other thing is, if you

  • have something you are very specific in, I

  • would look up at the list of sponsors, list of speakers,

  • and try to be very strategic about the resources that are available

  • and the people there are there that you can

  • talk to try to get into whatever in health care business.

  • Or health care business you get into.

  • COLTON OGDEN: Great.

  • Good response.

  • Jack [INAUDIBLE] saying it's inexecutable for Windows,

  • but it download froze at 100 percent.

  • That is unfortunate.

  • I would maybe just try it again, probably.

  • I've had that happen to me on Windows a couple of times.

  • And Chrome.

  • Chrome will say that the download is going,

  • and then when you gets 100 percent it will just kind of chill for a while.

  • But yeah.

  • That would-- for me as well.

  • Let it be.

  • Sometimes you have to wait it out a little bit,

  • and then it'll save the larger files and Chrome into a--

  • yeah exactly.

  • [? Babbick Night ?] just said it will complete, and then you can install.

  • ANDY CHEN: (SINGING) Let it be, let it be.

  • COLTON OGDEN: Just don't divorce too many ex-wives, that gets expensive

  • says [? Twitch Hello World. ?]

  • ANDY CHEN: Oh, nice.

  • That's good advice, keep it in mind.

  • All right, so speaking of [INAUDIBLE] a few minutes while people catch up,

  • I think.

  • It's-- so I can open this up.

  • So this is what RStudio looks like.

  • COLTON OGDEN: Does it get any bigger by chance?

  • It is a little small.

  • Might be small on the screen.

  • ANDY CHEN: Yeah, I can do that.

  • COLTON OGDEN: Beautiful.

  • ANDY CHEN: Well actually that's real nice.

  • COLTON OGDEN: That's great.

  • ANDY CHEN: Yeah.

  • So RStudio has-- let me actually I don't like this full screen.

  • COLTON OGDEN: You need option plus it'll expand without actually

  • going to full screen mode.

  • Or option, click the plus.

  • ANDY CHEN: Way out here?

  • COLTON OGDEN: The green plus, yeah.

  • Hold Option and click that green plus.

  • ANDY CHEN: Oh.

  • COLTON OGDEN: No, this green one up here.

  • No, yeah, that one.

  • ANDY CHEN: Thank you, that's a good-- that's a good hack to know.

  • COLTON OGDEN: I had the same issue with another application.

  • I forgot what it was, and I wasn't having any of the full screen.

  • Oh, and I forgot to shout out to all of the people to that followed now

  • and before the streams.

  • Let's do that really fast.

  • So we have Notice, and actually it looks like [? Digleen, ?] [INAUDIBLE]

  • [? Conciliated, ?] [? Stadium'91 ?] you [INAUDIBLE]..

  • We have Alaska Ukraine, Newtown kings, savage X factor.

  • Tono A 30, and Kate@00.

  • Thank all of you for following today.

  • ANDY CHEN: Yeah, thanks for coming out.

  • COLTON OGDEN: Quite a good number of people.

  • ANDY CHEN: If I had Thanksgiving leftovers, I would share it,

  • but I ate them all.

  • COLTON OGDEN: Oh, yeah.

  • ANDY CHEN: In my tummy.

  • COLTON OGDEN: You did the right thing.

  • What do we see, we have a blackjack counting card.

  • ANDY CHEN: I don't know what that is.

  • COLTON OGDEN: Steve [INAUDIBLE] sent a blackjack counting cards link and see

  • it looks like high-low.

  • High-low be like the classic phrase.

  • Yeah.

  • Cool.

  • ANDY CHEN: Nice.

  • By Robert Springer.

  • COLTON OGDEN: Oh, Robert Springer.

  • Gotta run.

  • Keep up the good work.

  • Thanks Rob for tuning in.

  • Hopefully catch it on YouTube.

  • We'll see you next time.

  • And good, it sounds like Jack Welch got the download working after all.

  • Yeah, sometimes chrome is weird like that or whatever browser on Windows.

  • It just takes a couple seconds.

  • ANDY CHEN: To finish up.

  • COLTON OGDEN: Not sure why, but, you know.

  • Such is life.

  • ANDY CHEN: Great.

  • OK, so now we've opened up RStudio, which is our integrated development

  • environment in which we'll be working with R. Great, so

  • let's familiarize ourself a little bit with how the actually works.

  • So you actually have your console here, terminal if you

  • want to do some command line stuff.

  • LS, CD.

  • COLTON OGDEN: That's like your actual terminal

  • for Mac versus the console being like within RStudio.

  • ANDY CHEN: Correct, yeah.

  • I think it's-- yeah.

  • It's wait.

  • I forget how command line works.

  • Well anyways, it's there if you want to use it.

  • COLTON OGDEN: Beautiful.

  • ANDY CHEN: And then, so over here this is a sort of--

  • and I don't know how to describe this little panel here,

  • but this is where all of your values, all your variables will be stored.

  • Not stored, it will be displayed.

  • So you can actually look at them.

  • I'm going to get rid of that, because you're not

  • supposed to be able to see that yet.

  • The environment is empty.

  • And so there's a little broom brushes right here,

  • and we'll let you clear stuff.

  • If for example I had some variables in here that I just cleared,

  • which I just did.

  • Let's say--

  • COLTON OGDEN: They've abstracted it into a brush?

  • ANDY CHEN: Yeah, it's very--

  • COLTON OGDEN: Scoop all the garbage out there.

  • ANDY CHEN: It's very high level.

  • It's not machine code, it's human code.

  • COLTON OGDEN: Yeah.

  • ANDY CHEN: So let's say that I have a very messy console, right?

  • It's messy, and I clean it.

  • Oh look, it's clean.

  • For those of you who like clean consoles, that's a very nifty--

  • COLTON OGDEN: Clear consols are nice.

  • ANDY CHEN: Clean consoles are very nice.

  • COLTON OGDEN: Clear and terminal as well, shout out to Clear.

  • ANDY CHEN: Oh, really?

  • Just clear.

  • Oh, dude, I'm getting all the tips, Colton.

  • COLTON OGDEN: We had we had a Linux command stream.

  • Nick?

  • ANDY CHEN: Yeah, Nick.

  • COLTON OGDEN: Right, yeah we had a lot of juicy tidbits in there.

  • I don't know if Clear was one of the things that--

  • does Python have something similar for the environment window?

  • ANDY CHEN: It depends on what your IDE is.

  • I usually work in Jupiter and notebook, which I think does actually.

  • I just don't know what command is in Jupiter.

  • COLTON OGDEN: I don't work too much in those.

  • I know PyCharm is--

  • people love PyCharm.

  • I'm not if that has something similar as well?

  • ANDY CHEN: Mm, I'm not sure.

  • COLTON OGDEN: That's getting a lot of popularity in the Python community,

  • if it hasn't already for a long time.

  • ANDY CHEN: Right it's--

  • I mean that seems like it's a pretty useful thing to be able to do.

  • So I imagine a lot of popular IDE's have that available.

  • Windows defender scanner-- oh.

  • COLTON OGDEN: Oh, yeah.

  • Yeah, good point [? Jacob ?] about the--

  • or [INAUDIBLE] about the Windows Defender.

  • That would make sense.

  • It just wants to make sure you're not giving everybody viruses.

  • ANDY CHEN: They got me.

  • COLTON OGDEN: Step by step by step, every step of the way.

  • [INAUDIBLE] I'm guessing PyCharm are the ones you talked about.

  • ANDY CHEN: OK.

  • Well actually I think there's a faceless-- no.

  • There's an health interest in patients with learning disabilities and autism.

  • Mm-hmm.

  • That's not my field, but the fact that you have something narrow and specific,

  • that means that you should definitely-- if that's

  • something you want to be working in, look up

  • the list of guests, the list of speakers, list of companies who

  • are supporting, and also maybe if there's a list of other people

  • who are joining, and then try to reach out to them as much as possible.

  • And then brainstorm some ideas.

  • COLTON OGDEN: And [INAUDIBLE],, thank you very much for following.

  • ANDY CHEN: Hi.

  • All right so, this is our studio, and the first thing we're going to be doing

  • is importing the data set.

  • So actually I should probably give a few words on the data set we're working on.

  • COLTON OGDEN: Oh, yeah definitely describe what

  • it is, and the full name of it too.

  • ANDY CHEN: Absolutely.

  • So NHANES, is the National Health and Nutrition Examination

  • Survey, which is a very long running and I

  • think it's an annual survey that covers actually a lot of different things.

  • But it's a CDC study that happens pretty much every year.

  • There's NHANES one two and three, which cover different things,

  • but if you explore the website--

  • let's see.

  • About NHANES, blah, blah, blah, blah, blah.

  • You can have access to what the questions are

  • the way that the interviewers actually got

  • this information from their patients, but so there are a lot of things

  • that you can learn about.

  • So anemia, cardiovascular disease, diabetes, environmental exposures,

  • eye diseases, et cetera, et cetera.

  • And yeah.

  • So it's readily available on the CDC website,

  • however the way to actually use the data--

  • so it's readily available, but the way to actually get into the data is

  • a little bit confusing, which is why we're using a pre-rendered .text,

  • nhanes.text file, because in R--

  • so the reason it's difficult to parse is, a lot of these data, these files,

  • come out in Sas, which is the language you talked about earlier,

  • in Sas format which we cannot use in R. So that's why we're using this dot text

  • that we have in the mega upload link.

  • COLTON OGDEN: Cool.

  • Makes sense.

  • ANDY CHEN: Sure so--

  • COLTON OGDEN: Do you have a specific field you're interested in applying to,

  • Andy, says [? Twitch Hello World. ?]

  • ANDY CHEN: Oh, hm.

  • Yeah, that's a really good question.

  • I think-- so the work that I support in the lab that I work in

  • is regenerative biology.

  • And so I suppose I would be interested in going into regenerative medicine

  • as a field.

  • COLTON OGDEN: Like stem cells?

  • ANDY CHEN: Yeah.

  • Exactly.

  • It's part of the Harvard Stem Cell Initiative, or Institute.

  • HSCI.

  • A lot of I's.

  • COLTON OGDEN: Get those limbs grown back.

  • ANDY CHEN: It's absolutely.

  • If you guys are interested, you should look up axolotls, A-X-O-L-O-T-L.

  • These are tiger salamanders and anbystoma salamanders that if you cut

  • their arms off, they go right back.

  • And I think parts of their hearts and their tails and parts of the brain.

  • COLTON OGDEN: Soon to be human DNA.

  • ANDY CHEN: You heard it here.

  • You heard it here first, folks.

  • COLTON OGDEN: That would be pretty cool.

  • ANDY CHEN: All right, so I think.

  • Nice, bro, thanks.

  • I'd definitely like to friend you.

  • How could I get your contacts?

  • I think if you email one of us, or email Cole later on, he

  • can probably put you in touch.

  • Cool.

  • And then [INAUDIBLE] says I opened it, now what.

  • If you tell me--

  • I'm not sure what you opened, but if you give us a little more information we

  • can probably try to help troubleshoot.

  • COLTON OGDEN: Maybe the R Studio.

  • Maybe they're thinking--

  • ANDY CHEN: Oh, now what we do.

  • Oh, sorry we got distracted by the comments.

  • Let's finish up some of the comments, then

  • we can get into R. Soon to be on wolverines.

  • True.

  • COLTON OGDEN: Yeah, that's true.

  • Then we will-- that's the whole goal, the whole motivation.

  • ANDY CHEN: That's the only reason I'm doing it.

  • Let's be honest.

  • Do you do brain regeneration?

  • I've been helping stroke rehab.

  • Oh, that's really awesome.

  • So I don't personally study any brain regeneration,

  • and actually most of the stuff I do is computational.

  • But there is a lot happening in the field of brain regeneration

  • in zebrafish and axolotls, so there are definitely

  • a lot of faculty professors out there who are studying that.

  • There was a [INAUDIBLE].

  • OK, cool.

  • [? Heard ?] something on edX and I should have instead ted--

  • Oh, TED.

  • Well, I got rep edX because they're right above us.

  • OK, so back to the IDE.

  • So this is what the IDE looks like when you open it.

  • So now what we're going to do is we're going to open the NHANES data set,

  • and the way-- there's actually two or three ways to do that,

  • but in terms of the actual interacting with the GUI

  • there's two ways to do that.

  • So the first one is, under this environment pane

  • right here there is the import data set drop down menu.

  • And so we're going to click that and then

  • we're going to import a data set from text,

  • and then base is the second part of that.

  • And so we're going to navigate to where our NHANES is,

  • and so it should come up with an import data set window that looks like this.

  • And then what we're going to do is we're going to do heading,

  • we're going to check yes just to make it pretty.

  • So if you notice this is what it looks like if heading is yes.

  • The actual heading's in the file itself, and here's what looks like if it's no.

  • In which case V1 is not nearly as apt as ID.

  • So we're going to do that.

  • And then we're going to import.

  • COLTON OGDEN: Makes sense.

  • ANDY CHEN: Yeah.

  • So there you have it.

  • We actually-- in this window that popped up above our console and terminal,

  • we actually have the data itself visualized in the IDE.

  • COLTON OGDEN: It's effectively sort of turned into Excel.

  • ANDY CHEN: Essentially, yeah.

  • It's a spreadsheet inside of your IDE, which

  • is one of the reasons why R-- well RStudio's popular is

  • you can work your statistics and have access to the spreadsheet

  • that you're working with.

  • Which I think you can probably do in certain R-- or Python IDEs,

  • but most are not I don't think designed to do that.

  • COLTON OGDEN: Makes sense.

  • ANDY CHEN: So let's actually take a walk through the NHANES data set,

  • and just check out what's inside.

  • So IDE is the idea of the patient.

  • That's not super interesting to us, but--

  • so we have a survey year.

  • So I think all of these might be 2009, 2010.

  • Yeah, because these are-- it's NHANES is an annual survey, or roughly

  • an annual survey.

  • I think it's to us and time that I might be wrong.

  • We also have gender.

  • Well probably a more apt term in usage common parlance today

  • would be sex, the physical sex of an individual.

  • Age and then bracketed into age decade, which

  • we'll talk about why they did that.

  • We're going to talk about a little bit about the kinds of variables

  • that people use in statistics, because that determines what kind

  • of statistical tests you will run those on when you're trying to figure things

  • out from the analysis.

  • Age months, race, education level, whether or not they're

  • married, their income.

  • So there's a lot of socioeconomics in it, too.

  • Even like the number of rooms in your house.

  • It's kind of an interesting data point to have.

  • Whether they're working, their weights.

  • So socioeconomics, physical characteristics, BMI.

  • So we're getting a little bit more towards some health

  • characteristics, some disease statuses.

  • Sati?

  • Statuses?

  • COLTON OGDEN: I think statuses is--

  • ANDY CHEN: Statuses?

  • COLTON OGDEN: I think if we're doing official Latin it would be statiae,

  • but I think that statuses.

  • ANDY CHEN: Oh, is this supposed to be in Latin?

  • I think I didn't get the memo.

  • COLTON OGDEN: Yeah, the rest of it's going to have to be in Latin.

  • Sorry.

  • ANDY CHEN: Pulse, et cetera.

  • There's actually a lot.

  • So there are-- we'll see exactly-- well actually right here

  • there are 10,000 entries, it says.

  • But I'll also be showing you a line that will tell you how many rows there are.

  • Blood pressure, systolic, diastolic.

  • Testosterone levels, direct cholesterol, the volume of your urine,

  • something I always want to know of course.

  • Diabetes, the days with bad mental health,

  • depression, number of pregnancies, babies, alcohol consumption.

  • There's a lot.

  • So the what I'm trying to say is, NHANES is a not comprehensive,

  • but is a very, very wide breath data set that you

  • can actually-- if you're interested in learning

  • about parsing this data to look if there are any trends that you

  • want to learn about, it's a really good data set to start with.

  • COLTON OGDEN: Yeah.

  • Looks like there's a lot of fields in there.

  • ANDY CHEN: A lot of fields--

  • marijuana, age of first marijuana.

  • COLTON OGDEN: Definitely, the more information

  • you have is more useful obviously, than having less information.

  • ANDY CHEN: Absolutely.

  • That's what all data scientists will tell you.

  • Hard drugs?

  • Yes please!

  • That's for the consumption of have you ever consumed hard drugs, I imagine.

  • And then some sexual activity.

  • And yeah.

  • So that's all the different things we have here.

  • So this is the enhanced data set.

  • So that's one of the ways to import it.

  • Let's clean this out right let's say that we're starting over.

  • COLTON OGDEN: Asa had a good question.

  • She said the first three IDs are the same.

  • Was the same patient tested thrice on different occasions?

  • ANDY CHEN: Ooh, that's a really good question.

  • Well, let's open it again first.

  • The other way to do it is File, Import Data Set from Text Base.

  • Great.

  • And again, we want to check heading.

  • All right.

  • That's a really good question.

  • It looks like it's repeated.

  • So if you look at actually all the other data, it is exactly the same.

  • So like for example, the number of rooms is 6, 6, and 6.

  • 9, not working.

  • I think this-- yeah.

  • So the first individual--

  • COLTON OGDEN: It's the same ID too.

  • So maybe you would put this-- like, you would sort of make a set of the data

  • where every idea is different.

  • ANDY CHEN: Right.

  • So that's probably really good idea.

  • There appears to have been a replication error here.

  • COLTON OGDEN: And asking it's actual people's data, right?

  • L-o-l.

  • ANDY CHEN: Yeah, no, this is publicly available.

  • It's off the CDC website.

  • COLTON OGDEN: I heard that China created genetically modified babies recently.

  • ANDY CHEN: Yeah, like yesterday.

  • They talked about like, two days ago, one of the scientists

  • had someone CRISPR, which is a gene editing technique.

  • One of the, I think, probably like, T cell receptors for an HIV virus

  • out of a--

  • they CRISPRed it out of a baby's genome so that they don't have there--

  • So the way that HIV works is it--

  • the immunologists might get on me, but it's

  • a virus that attacks one of the cells that's critical for it

  • to the human immune system.

  • My understanding of what happened with that China baby case

  • is the father had HIV or AIDS.

  • And then so the scientist CRISPRed, which

  • is a gene editing [? talent, ?] CRISPRed the gene for one of the cell

  • receptors out of the genome so that the baby's immune system

  • cells can actually-- the HIV has no way of getting inside the cell.

  • COLTON OGDEN: Interesting.

  • They have to do that when the baby is like practically

  • like, just after being a zygote.

  • Because otherwise you'd have edit--

  • ANDY CHEN: A trillion, like, a billion cells.

  • COLTON OGDEN: So they do it when it's just impregnate--

  • or the woman is just impregnated probably.

  • ANDY CHEN: That's probably very, very early on.

  • Actually, I have no idea how they did it in humans.

  • I don't if it was in the woman herself, or if it was more like a test tube baby

  • situation.

  • COLTON OGDEN: Yeah, that would make sense.

  • That would make sense.

  • That would be easier.

  • In vitro probably would be really difficult.

  • ANDY CHEN: Yeah, it probably would be.

  • Well, I don't know.

  • I don't [INAUDIBLE].

  • COLTON OGDEN: Like, for the woman, I feel

  • like it could be difficult to be constant operated on.

  • If there was like, repeated follow up.

  • ANDY CHEN: Right.

  • Plus like, there was like, sort of health questions about--

  • I don't know how you would isolate the baby specifically inside of--

  • COLTON OGDEN: It would be rough.

  • ANDY CHEN: That actually brings up a good point.

  • That's one of the reasons why gene therapies are

  • sort of questionable, is because like, they generally

  • work on the scale of single cells.

  • And so if you're trying to do provide gene therapy for an adult,

  • that's a lot of cells that the retrovirus [INAUDIBLE]..

  • COLTON OGDEN: If I were a scientist working in that,

  • I have no obviously, information about that or context.

  • But I would imagine it'd be easier just to start from the very beginning

  • and get just the sperm and the egg, and then manipulate those cells.

  • And then those cells would then replicate.

  • And then the therapy that we've provided to the original cell

  • would then propagate to the other cells.

  • ANDY CHEN: Exactly, yeah.

  • COLTON OGDEN: But I'm no expert.

  • ANDY CHEN: Yeah.

  • That's how that would work.

  • It's much easier if you start earlier on.

  • Anyways, I think we have some question.

  • Gattaca becoming real?

  • Gattaca is becoming real.

  • It's true.

  • COLTON OGDEN: I actually don't know what that is.

  • ANDY CHEN: It's a Battlestar Galactica.

  • It's a SyFy series.

  • I think there's like very similar to human robots that are-- or something.

  • The gattacas I think, are the--

  • I mean, I might be getting this wrong.

  • I've never seen it.

  • Enhance is a dot text.

  • COLTON OGDEN: Do we need to hook you up with a power supply probably?

  • ANDY CHEN: 58%?

  • I think I'm OK for now.

  • But I do have one in my pack in case I need to get it at some point.

  • COLTON OGDEN: OK.

  • If you want to continue, we're probably running low on battery.

  • You can keep going and then just give me your power supply,

  • and I'll plug it in for you right now.

  • ANDY CHEN: Sure.

  • It's in my blue bag in the main folded under the jacket.

  • Cool.

  • So now that we have opened [INAUDIBLE] and we

  • can sort of look through the data here, let's just use R as a calculator.

  • Because that is one of the reasons it's popular,

  • RStudio is popular is because it's easy to use.

  • It doesn't look exactly like you're coding hard core.

  • It's easy for-- ooh--

  • COLTON OGDEN: What happened?

  • ANDY CHEN: It's a black screen.

  • COLTON OGDEN: Did your computer go to sleep?

  • ANDY CHEN: No, I don't think so.

  • Well, there we go.

  • Thank you so much.

  • COLTON OGDEN: [INAUDIBLE] I believe.

  • ANDY CHEN: Teamwork makes the dream work.

  • Very nice.

  • COLTON OGDEN: I'm not sure why your screen went black.

  • ANDY CHEN: Yeah, I'm not sure.

  • Maybe it just hate me.

  • COLTON OGDEN: That's probably it.

  • I think you figured it out.

  • ANDY CHEN: You can use like, a giant calculator.

  • For instance, Colton, off the top of your head,

  • what is the product of 933 times 186?

  • COLTON OGDEN: If I got that right, that would be amazing.

  • ANDY CHEN: Yeah.

  • Do it, do it, do it.

  • COLTON OGDEN: 900 times 100 would be-- what would that--

  • That would be 90,000.

  • So I'm guessing like, 126,000 something.

  • ANDY CHEN: Give me some random digits in there.

  • COLTON OGDEN: 1, 2, 6--

  • 1, 2, 6, 1, 4, 4?

  • ANDY CHEN: 1, 2, 6, 1, 4, 4.

  • I mean, in the same order of magnitude, 170,538.

  • COLTON OGDEN: I'm terrible at that kind of math.

  • ANDY CHEN: So I actually sometimes I use R for my homework for my problem

  • sets because if I don't have a pen or paper, I can like, put it in here.

  • I can remember it.

  • So we're doing this in console.

  • Should I do this in a script?

  • What do you think?

  • COLTON OGDEN: You do what--

  • you multiply 16 by 16.

  • [INTERPOSING VOICES]

  • ANDY CHEN: Can you do it Colton?

  • Hold on.

  • Block the screen.

  • Block the screen.

  • COLTON OGDEN: Is at 100 and-- wait.

  • No, no it's not 196.

  • What is it?

  • Because what's 16 times 6?

  • What's 10 times 6?

  • [INAUDIBLE] So what?

  • 60 plus 96?

  • Would it be 254?

  • ANDY CHEN: 256, dog.

  • You're close.

  • You're close.

  • I did it off the top of my head.

  • COLTON OGDEN: I got--

  • clearly.

  • Yeah.

  • So in the console, we can actually run R commands here.

  • But we can also actually--

  • New Script-- we can run it as a script in the folder up here.

  • So as you notice on the screen, we just open up a new thing right

  • here, a new R script.

  • And so if you want to do 16 times 16--

  • and we so we wrote it.

  • And then we run it, it'll print 16 times 16 down here in your console.

  • COLTON OGDEN: So it's kind of like Python

  • in that you can execute line by line exact script.

  • ANDY CHEN: And I think that's because it's interpreted,

  • not because it gets compiled.

  • COLTON OGDEN: Yeah, that makes sense.

  • ANDY CHEN: So for instance, I do my homework here sometimes.

  • So let's say like, a product is 82 times 93.

  • Later on, I'll be like, oh I want to actually know 40 times the product.

  • So I'll be like, oh I don't know, but I can do 40 times product

  • and then run it.

  • I'll have to run from the top.

  • So one thing that you should know about the script

  • is you have to write it from the top.

  • It is not stored in local memory.

  • It needs to run before it happens.

  • If you are used to using Jupyter Notebook, it's really similar to that.

  • So we run.

  • We stored 8,293 into a variable called product.

  • And so product is now in memory, which is we talked about it before here where

  • our local variables are.

  • Or I guess-- in the context of this, [INAUDIBLE]..

  • COLTON OGDEN: Yeah, just your environment.

  • So whatever your current--

  • ANDY CHEN: Things.

  • I don't know if you would call it a global local variable.

  • COLTON OGDEN: There's like, a frame.

  • I think it's called a frame.

  • And it basically just whatever all the--

  • it's global environment.

  • So I imagine it's got its own global variables in this context?

  • ANDY CHEN: I think so.

  • Yeah, that would make sense.

  • Let's go with that.

  • So let's say that later on my homework, like, oh!

  • Now I need to take 40 times product.

  • I don't remember what product was, but hey, it's stored here.

  • So I can get one product.

  • It's 305,040.

  • Who knew?

  • R did.

  • COLTON OGDEN: Yeah.

  • R has got the hookup.

  • ANDY CHEN: I got the hookup.

  • COLTON OGDEN: And thanks to Twitch Hello World saying I'm

  • a good sport, [INAUDIBLE] for saying Colton is super smart, though.

  • He just can't multiply.

  • And I'm actually disappointed, because now I

  • realize that we did ask, though, 16 by 16 in the chat.

  • I feel like it's something a programmer should know, just off of their head

  • now.

  • 16 times 16, because 256, right?

  • I feel like that's something that I have to have that memorized.

  • ANDY CHEN: I definitely would not be able to do that yet.

  • COLTON OGDEN: Corrugated Drop, thank you very much for following.

  • We're doing live mental multiplication on stream.

  • It's great.

  • ANDY CHEN: Yeah.

  • All right.

  • So let's see.

  • So let's oh actually-- so the third way to load data set is as follows.

  • In your script or in your console-- because they

  • are sort of functionally a [INAUDIBLE] in the sense

  • that what other commands you write in console you can also do in scripts.

  • The difference is scripts, you can run over and over.

  • It's being stored as like, as a program, as a script.

  • So the other way that we can load a data set

  • is I'm just going to make a variable called NHANES.

  • And I'm going to store it as read.dlim.

  • And then whenever the location of my data set, which

  • is as follows, if I do that, then what has actually been done is I've

  • stored NHANES in a variable called NHANES.

  • So let's try that again.

  • Let's say that my look environment is empty.

  • If I run this command--

  • whoops.

  • It does not like that.

  • That should work.

  • I wonder why it's not working.

  • It's a period here.

  • That should be a--

  • COLTON OGDEN: And it should be Downloads, right?

  • ANDY CHEN: Ooh, good catch.

  • Maybe?

  • Yeah, there we go!

  • And so it pops up.

  • Thank you, Colton.

  • COLTON OGDEN: I try, you know.

  • That's what friends are for.

  • ANDY CHEN: Teamwork.

  • Teamwork brought to you by teamwork, the official drink of teamwork.

  • COLTON OGDEN: Yeah, exactly.

  • ANDY CHEN: So as you'll notice, the NHANES data site pops up on your right,

  • right here.

  • So that's the third way to import a data set.

  • So now that we have our data set here, let's--

  • and so earlier, I was mentioning on how it

  • might be useful to know how big your data set is.

  • And so one of the really easy ways to do that is there's a command called NRow.

  • If we call NRow, and then pass in NHANES as its argument,

  • when the return's done--

  • so I've saved it.

  • I keep forgetting.

  • Now we have to actually run it.

  • In our console it returns NRow is called.

  • NHANES is passed in as the argument.

  • And it returns 10,000.

  • COLTON OGDEN: So it just basically show for number of rows?

  • ANDY CHEN: Number of rows.

  • Yeah.

  • So it's a very good--

  • if you forget like, how big your thing is,

  • it's a very useful command for that.

  • COLTON OGDEN: Cool.

  • Great

  • ANDY CHEN: All right, so Colton, you were

  • talking earlier about making a subset of data that would be interesting.

  • So I suggest that today.

  • And not a suggestion.

  • This is the only way we're going to do it,

  • because it's the only thing I have notes for.

  • Let's make a subset of the data for just pediatric patients.

  • So conceivably, can you think of a situation

  • where I would like to know a certain data from the NHANES population,

  • but I only care about knowing it and in children?

  • COLTON OGDEN: Check for age?

  • So check for age is less than 18?

  • ANDY CHEN: Yeah, exactly.

  • So the way we do that-- whoops.

  • We're going to go back into our script.

  • We're going to take a new variable called NHANES pediatric.

  • And then say we're going to call a subset, which is a function that

  • subsets a data set into a new data set.

  • And then it's going to take in the original NHANES as an argument.

  • And then we're going to give it these parameters where the age--

  • so what this line is doing is we're making

  • a new variable called NHANES pediatric.

  • We're going to call subset and passed in NHANES original data set

  • as the n argument with the caveat that we only

  • want entries rows individuals whose age, which let me open NHANES here--

  • it's actually a column here-- and age were here--

  • the value of it is equal to or less than 18.

  • And so that makes someone a pediatric.

  • COLTON OGDEN: OK, makes sense.

  • ANDY CHEN: So let's run it.

  • And oh look!

  • Right here popped up a new data set in our global environment.

  • And so we have successfully substituted the NHANES data

  • set to the point where we're only looking at pediatric individuals.

  • COLTON OGDEN: And then Faceless Voice is saying

  • could you remove the duplicate IDs?

  • Do you know off hand how to do that?

  • Is there a set [INAUDIBLE] off of a key?

  • ANDY CHEN: You know, there's a way to do it, but I--

  • let's think about it.

  • OK, so.

  • We could run that again.

  • So where ID is 5, 1 is not equal to--

  • I wonder of that'll work.

  • It'll get rid of the original.

  • COLTON OGDEN: It'll get rid of all of just the 5, 1, 6, 2, 4's, but that

  • won't actually make it a set.

  • I'm curious if there's like, an R set function.

  • That would be pretty handy.

  • Sets.

  • Make set with order.

  • There's a lot of documentation.

  • Let me see.

  • R set on key.

  • I feel like that would be--

  • there's so many function called like, set key.

  • What's the purpose of setting a key in data.table?

  • ANDY CHEN: Every time I do a deep dive into the docs, I die a little bit.

  • COLTON OGDEN: How to perform a cumulative sum of the unique IDs only.

  • ANDY CHEN: So Twitch Hello World, is this similar to Excel?

  • There are a lot of things that you could do in Excel that you can do in R.

  • But there are arguably more things you can do in R than you can do an Excel.

  • R is a-- some people call it a statistical language,

  • because it's very useful to perform--

  • it's very easy and useful for statistical analysis.

  • But a lot of things you can do in Excel, but R is more powerful in the sense

  • that it's more low level, and you can implement your own functions.

  • Although you can do that with macros in Excel.

  • But I think it's more powerful, it's more low level.

  • COLTON OGDEN: Yeah they wanted to figure out how

  • to get rid of all the duplicate ideas.

  • Yeah, if you don't know the function offhand,

  • we can maybe forge ahead and then--

  • it looks like it's somewhat hard to--

  • oh, maybe the unique functions [INAUDIBLE] is saying?

  • R unique function.

  • Oh, OK.

  • So what Unique will do is it will actually just

  • get rid of all duplicate rows, which I think will work for R use case.

  • ANDY CHEN: Yeah, I think it does.

  • Yeah.

  • COLTON OGDEN: So just call yeah, I guess, unique first.

  • You'd be like, NHANES unique, because you probably want to do it after the--

  • or I guess you want to on that.

  • Yeah, that works.

  • ANDY CHEN: Yeah, either way, I think it's would probably be OK.

  • OK admitted [INAUDIBLE].

  • Let's look at how many rows there are.

  • COLTON OGDEN: Oh, but you probably want to assign it to a variable too, right?

  • ANDY CHEN: Yes.

  • Into itself, you think?

  • COLTON OGDEN: Sure, yeah.

  • ANDY CHEN: So let's run that.

  • Cool.

  • OK.

  • So let's actually--

  • COLTON OGDEN: So then you can print n rows on NHANES pediatric and then

  • before you do the function, yeah.

  • ANDY CHEN: And then that will tell us how many individuals.

  • So we should see at least three fewer.

  • COLTON OGDEN: So thanks Bella, for tossing that in the chat.

  • Magus 503 says it also duplicated.

  • So I guess that will give us how many are the same?

  • ANDY CHEN: Are the same?

  • That's a good one to know.

  • Thank you.

  • COLTON OGDEN: It looks like it's printing out for some reason.

  • ANDY CHEN: Yeah.

  • But that's because I haven't saved it to a variable.

  • COLTON OGDEN: Oh, I see.

  • Gotcha.

  • ANDY CHEN: So if we did not call unique, then what happens

  • is we have 2,628 entries.

  • And if we do end up calling it, we have 2,246 entries.

  • COLTON OGDEN: OK, nice.

  • So there's 400 duplicates.

  • ANDY CHEN: That's quite a lot.

  • COLTON OGDEN: Duplicates.

  • ANDY CHEN: Duplicitous.

  • What does duplicitous mean?

  • COLTON OGDEN: Good question.

  • I actually don't know.

  • ANDY CHEN: Like serendipitous?

  • COLTON OGDEN: We can--

  • we can use the good old dictionary app.

  • Duplicitous, deceitful.

  • ANDY CHEN: Oh really?

  • Oh!

  • Oh, that makes sense.

  • COLTON OGDEN: Treacherous, duplicitous.

  • ANDY CHEN: Treacherous!

  • COLTON OGDEN: The Vocabulary stream as well.

  • ANDY CHEN: Yeah, we're learning lots of things today.

  • Great.

  • So now we have made a subset with unique individuals.

  • Very nice.

  • OK.

  • COLTON OGDEN: And we did it live.

  • We figured it out live, even better.

  • Even better.

  • Thanks to the Twitch stream for shouting it out for us.

  • They got our back.

  • They always got our back.

  • ANDY CHEN: We always got your back.

  • Oh actually, so this is interesting.

  • In statistics in general, if you have a data set

  • and you have information that is what you think erroneous,

  • or it's absent-- like, for instance, let's

  • say you have an individual who didn't give an age or a gender or a BMI et

  • cetera, you have to consider is there a reason for why that data was omitted.

  • Because there might be an underlying factor

  • there and that is actually causing that data to be admitted.

  • That's just something that you have to be considerate of when you're

  • performing biostatistical analysis.

  • So for instance, let's say that--

  • so the sum function is interesting.

  • It's going to count how many times the argument occurs.

  • And I'm going to say is .na, which is saying the thing is not a thing.

  • It's confusing, but I'll explain it in a minute.

  • Let's look at NHANES pediatric.

  • And then we use a dollar sign, which is to parse into and say hey,

  • which variable are we looking at?

  • I think most of these are going to be not empty.

  • So blood pressure systolic av.

  • That's some kind of blood pressure thing.

  • So my guess is that there's going to be some empties here, but let's run it.

  • Yep.

  • So sum is it counts how many things happen of the thing that's inside.

  • And what we put inside is, is .na.

  • And what this is saying is like, the thing itself,

  • if it is absent, na, like not applicable.

  • And then what is not applicable?

  • In the data set NHANES pediatric, specifically for BP sys av.

  • So what that means if we're going to visualize it,

  • is it goes into this NHANES data set.

  • It goes into the variable called-- what do we do?

  • Is this [? b sys av ?] or something?

  • COLTON OGDEN: Yeah, it's one of those ones that's kind of abbreviated.

  • [INTERPOSING VOICES]

  • Oh, no, you had BP sys av.

  • ANDY CHEN: BP sys av.

  • Oh, thank you.

  • This one.

  • So some of these have entries in them.

  • But you'll notice that, for instance this one has an na in it.

  • It doesn't have anything.

  • So what I was alluding to earlier is in statistics in general,

  • you should consider why your data set might have an empty there.

  • Just because it's empty doesn't mean that you should admit it,

  • because there might be a very interesting reason and underlying

  • factor for why it's being admitted.

  • We're not going to consider that too much today.

  • But if you go further on into doing biostatistics or statistics in general,

  • you definitely need to put thought into why you're

  • omitting data if you decide to omit.

  • COLTON OGDEN: Makes sense.

  • Osire18 says I'm a biochemist out sick from work.

  • Thank you cssatv for some [INAUDIBLE].

  • ANDY CHEN: Welcome back to the-- so the world of bio stuff.

  • You can't escape it.

  • I hope you feel better.

  • So this line in line 11 is it's counting up each of these rows.

  • So one, two, each of these individuals, this person, this person, this person,

  • this person.

  • And it's saying in its data set for the variable BP sys av,

  • does it say na in it?

  • If so, add to sum.

  • And then so when we've called it, it returns 1,022.

  • So in our particular data set, our subset

  • of data set for pediatric patients or individuals, 1,022 of them

  • have an empty in them.

  • Interesting.

  • I don't know why, but the reason why might actually be important.

  • So something to think about.

  • COLTON OGDEN: And it look like they used the dollars sign kind of Excel syntax

  • there too.

  • ANDY CHEN: Oh, is that Excel syntax?

  • COLTON OGDEN: They have like dollar sign for like, the row.

  • ANDY CHEN: Oh, that's right.

  • Yeah, that's right.

  • COLTON OGDEN: And so that kind of reminds me of.

  • It's nice that you can index into this the sheet like that.

  • ANDY CHEN: That's the word to use.

  • Index, yeah.

  • COLTON OGDEN: Fancy CS words here, you know.

  • Indexing.

  • ANDY CHEN: My life is zero indexed.

  • COLTON OGDEN: There you go.

  • ANDY CHEN: How many people are in this room?

  • COLTON OGDEN: Uh, yeah, one.

  • ANDY CHEN: Very nice.

  • OK.

  • So let's actually do some interesting things.

  • So I suggest that we not look at BPS sys av,

  • because as enthralling as BPS sys av might

  • be, I think it might be easier to look at some more tangible data,

  • some more tangible variables.

  • So let's look at age, because it's easy to understand what age is.

  • Let's also look at gender, or as it's saved in gender.

  • But what we would probably call sex in common parlance--

  • and BMI.

  • So let's run all of those.

  • OK.

  • So none of the patients have age missing.

  • Great.

  • Zero patients here.

  • And then gender, none of the--

  • I should say individuals, not patients necessarily.

  • None of the individuals are missing gender.

  • So each of the individuals in the study have an assigned

  • or provided or were assigned a gender.

  • COLTON OGDEN: Because it's like, a mandatory field probably

  • for the survey.

  • ANDY CHEN: Exactly.

  • Right.

  • Yeah, that makes a lot of sense.

  • Because I could I could see how blood pressure might not be

  • the easiest thing to get from a baby.

  • And so that is probably would be easy to leave

  • blank, which is what we were talking about before.

  • Is there a reason that this is blank?

  • Maybe it's because it's a baby.

  • Maybe most babies are difficult to get blood pressures from, in which case

  • that's an interesting fact of itself.

  • But gender and age are pretty straightforward to get.

  • And the last thing we're going to look at is BMI, which, is everyone familiar?

  • Should I talk about BMI do you think?

  • COLTON OGDEN: Does it pertain to the--

  • ANDY CHEN: What we're going to do.

  • COLTON OGDEN: If it does, then sure.

  • ANDY CHEN: It's body mass index, I think.

  • It's just it's a measure of someone's health in terms of if they're

  • obese or skinny or overweight.

  • It's not super accurate, but it's been historically used.

  • And so that's what we're going to be looking at.

  • It ranges from a scale I think, 0 to like, 60 or something.

  • But just think about it as a continuous range.

  • That's the most important thing to think about.

  • COLTON OGDEN: It's the percentage of adipose tissue

  • to non-adipose tissue in the body.

  • ANDY CHEN: Oh, is that what it is?

  • COLTON OGDEN: Is it?

  • Body mass index.

  • ANDY CHEN: So I think it's actually more accurate than what it is.

  • I think it's just a proportion.

  • You take the ratio of your height to your weight,

  • and then assigns you some index value.

  • COLTON OGDEN: Oh, you're right.

  • It's the ratio mass versus height.

  • I thought it was a ratio of adipose to non-adipose.

  • ANDY CHEN: That's really specific.

  • COLTON OGDEN: That would be very hard to calculate super easily.

  • So yeah.

  • I guess weight to height makes more sense.

  • So that's easier to calculate.

  • And [INAUDIBLE] saying 38 plus 2 is that we're referring

  • to the people in the chat room.

  • No, it was a joke Andy was making about in this physical room, there is one--

  • because get it?

  • Because zero and then one is two in zero indexed.

  • ANDY CHEN: In computer science, most languages

  • are zero indexed, which means they start counting at zero.

  • So you wouldn't say one two people.

  • There's zero one persons.

  • Let's see, what else?

  • It's giving me error in unique NHAMES.

  • So, it's not NHAMES--

  • NHANES with an N. Maybe that's your error.

  • But not sure.

  • COLTON OGDEN: Yeah, that would makes sense.

  • ANDY CHEN: I think that's probably what that is.

  • Should we assign the data to NHANES [INAUDIBLE] line first?

  • I don't know which line they're talking about.

  • COLTON OGDEN: I think [INAUDIBLE] is responding to [? Babbick. ?]

  • ANDY CHEN: Oh.

  • Ah!

  • COLTON OGDEN: And [INAUDIBLE] is also responding to them.

  • I think [? Babbick ?] may have missed the line where you load the data set.

  • Do you want to bring that back up?

  • Just [INAUDIBLE] that line?

  • ANDY CHEN: This one?

  • COLTON OGDEN: The very first line where the very first data set gets

  • loaded using the read.delim function.

  • ANDY CHEN: Right here, yeah.

  • COLTON OGDEN: And then we're assigning that into a variable called NHANES.

  • ANDY CHEN: Correct.

  • That's right.

  • COLTON OGDEN: And then Steve is asking what R packages are you using?

  • ANDY CHEN: So we've actually loaded none so far.

  • But we will be using ggplot2 in a little bit to visualize some data.

  • That's probably the most common data visualization package in R.

  • COLTON OGDEN: Are you able to check if there are other columns that

  • have the same sum as BPS have?

  • ANDY CHEN: I can conceive we think of writing a script that can do that.

  • If you are interested in that, I might not be very efficient,

  • but the thing that comes off the top of my head is to have a for loop

  • to run that for each of the variables there are,

  • and then to store in some kind of data structure,

  • and then see if there are similar.

  • If you do it in the equivalent of a dictionary,

  • then you would just take the ones that are the same,

  • and then you would take the key and do whatever the key is in it.

  • What's the delim thing?

  • The delim thing is actually just the syntax for loading a data set.

  • So if I were to actually import data set and do this whole step

  • from the beginning.

  • It would do the same thing here but I've just

  • manually done it through a command.

  • Great.

  • So the interesting thing here is some of our pediatric individuals, 275 of them

  • actually don't have a BMI.

  • So that's interesting.

  • But again, we're just sort of going to ignore that

  • for the purposes of this demonstration.

  • But in real life, you should think about is there

  • a reason for why these BMIs are empty, because that could actually

  • be an interesting underlying reason.

  • COLTON OGDEN: Makes sense.

  • ANDY CHEN: Before we go too much further into R,

  • we should have a conversation about the kinds of variables in statistics.

  • So what is different between a variable called gender and BMI in terms

  • of what the possible answers could be?

  • COLTON OGDEN: Gender, in this model where

  • we're going based on biological sex, and it's basically zero or one,

  • it's almost like a Boolean, but in this case,

  • it's just a very limited set of options.

  • ANDY CHEN: That's exactly right.

  • COLTON OGDEN: And then whereas the BMI would be a floating point value that

  • could range between 0 and some-- basically, they

  • handle different ranges of potential values.

  • ANDY CHEN: Of potential value.

  • That's exactly right.

  • COLTON OGDEN: In this case actually, one is a different type of data.

  • One's a floating point value versus the other one is an enum.

  • ANDY CHEN: Very nice.

  • The answer I was looking for, which is more of the statistical approach to it,

  • is male versus female are the possible answers for gender,

  • as it's used in this data set.

  • Male and female are the only two possible answers,

  • which means that it, like a Boolean, which can be true or false,

  • it's either this or that.

  • There's a name for this kind of variable in statistics.

  • And it's called a dichotomous variable or a binary variable.

  • COLTON OGDEN: Makes sense.

  • ANDY CHEN: And so that's related to something else called

  • a categorical variable, which is sort of like, what color is this chair?

  • COLTON OGDEN: So that would be an enum, categorical variable.

  • In programming.

  • Like red, blue, yellow, green from a limited set of options.

  • ANDY CHEN: Enum--

  • So those are categorical variables.

  • These are things that don't have numbers in them per se.

  • But BMI, because it is a range of any possible value between 0 and whatever

  • infinity--

  • actually, I don't think BMIs go up that well, but it's a range of BMIs--

  • [INTERPOSING VOICES]

  • Yeah, go get on the treadmill or something.

  • I'm approaching that after Thanksgiving.

  • And so the contrast there is right.

  • These are two very distinct kinds of variables.

  • One of them is number based, and one of them is sort of categorical.

  • And so the way that you perform statistical analysis on these variables

  • depends on what kind of variables they are.

  • And so today if we have time, we'll talk about a few different kinds

  • of situations where the dependent and independent variables

  • are categorical or numerical.

  • And then we'll talk about the different techniques

  • that we can use to analyze that if we have time.

  • COLTON OGDEN: What's the Dlim thing?

  • ANDY CHEN: Theoretically and practically,

  • does R handle and unlimited amount of info in a data set?

  • I think it's a lot lower usage, like, processor intensive than Excel is.

  • So it's probably exponentially better at handling data than Excel.

  • But unlimited, no, because you'll overflow.

  • --two versus a lot.

  • OK.

  • BMI of space objects.

  • I like it.

  • Very nice.

  • So the first thing we're going to look at in our particular essence

  • is gender is a binary or dichotomous variable.

  • And we want to compare that against a continuous numerical variable like BMI.

  • And so the right tool to use in this particular instance,

  • the resonance of a tool is probably two separate populations, male and female,

  • and then suddenly medical quantity.

  • The first test that comes to my mind in statistics

  • is probably a T test, which is a comparison of means.

  • And so there are certain requirements, sort of assumptions that using a T test

  • requires.

  • We're not going to talk about them today,

  • but if you're interested in statistics, it's

  • important to think about whether to use a parametric test

  • or a non-parametric test, parametric being usually stronger,

  • but having stronger requirements.

  • At the basis of the law of statistics is the assumption

  • that data lies on some kind of naturally occurring distribution.

  • A lot of times it's a double or a Gaussian distribution,

  • like a bell curve.

  • Statistics is really looking at what is the likelihood of good data I'm

  • seeing being true, like, occurring, given

  • that the background distribution in reality

  • is probably something like this.

  • And so-- I lost track of where we are.

  • Oh, T tests!

  • So we're going to use a T test to compare gender and BMI.

  • And so the next step here is let's check to see

  • if there are any non-answers in our pediatric data set for BMI.

  • So we actually did check that.

  • So if you want check on the opposite direction, this is--

  • remember that is.na is saying it doesn't exist but exclamation point

  • is.na is it does exist.

  • So let's put these two together, run that and run that.

  • So we have 275 individuals that don't have BMI

  • and 1,971 individuals that do have BMI.

  • So if we sum those together--

  • 25 plus 1,971.

  • What it returns to 2,246.

  • And as we saw earlier above, that is exactly how many individuals we have.

  • So we're just double checking to make sure that everything is good.

  • COLTON OGDEN: Makes sense.

  • ANDY CHEN: And so the next thing we want to do is--

  • let's make a table.

  • So R is more useful than Excel arguably, because it's

  • very easy to make visualizations, although they're

  • a little bit more confusing to make.

  • So like in Excel, let's say you want to make

  • a table comparing what is the average for males

  • and what's the average for females?

  • You can do it, but you have to go insert plot,

  • and you have to choose the data et cetera.

  • And in R it can be very powerful, because it's just one line to do that.

  • And so the way you would do that is Table.

  • And then you would pass in as an argument the data

  • that you're looking at, which is, yeah--

  • and then we're going to look at on the basis of gender,

  • exclude equals false here is a parameter that

  • is saying if there are any empties, then show that there empties that exist.

  • And so it turns out there are 1,087 females and 1,159 males in your data

  • set.

  • There's nothing here to the right because there are no empties.

  • But if we choose BMI--

  • which there are empties-- and we have this

  • exclude equals false argument, what it's going to show-- ooh.

  • That's not what you want.

  • Oh.

  • The reason this is happening is because BMI is not a binary variable.

  • Table, you should probably only use with binaries.

  • Anyways, h?

  • No, h definitely is full too.

  • COLTON OGDEN: Well, you're doing it off of pediatric data though.

  • So you do have at least 1 to 18.

  • So it's visible on one line.

  • ANDY CHEN: Yeah, that's true.

  • So this is actually interesting too.

  • So the way to interpret this graph here is how many

  • entries are for an individual at least 1-year-old, 2-years-old, 3, 4, 5, 6, 7,

  • up to 18.

  • Because it's again, it's pediatric data.

  • There are 121 individuals who are less than one year of age

  • and 93 who are exactly 18 years old.

  • And so table is a very strong command that it

  • lets you do that in just one line, whereas to do that in Excel

  • would be a few more steps.

  • COLTON OGDEN: I like how easy it is to sort of lay the data out that way.

  • ANDY CHEN: Absolutely.

  • And so I think some people were asking earlier is this Excel or like,

  • what are the advantage of this over Excel?

  • Is it can be very powerful.

  • It can be very fast.

  • Just one line like this, and this would take you probably at least 10 times

  • as long I think, in Excel.

  • Maybe if you were really good, like, a little faster.

  • COLTON OGDEN: Yeah I'm not that great.

  • So it would probably take me long.

  • ANDY CHEN: Yeah, me neither.

  • COLTON OGDEN: Steve [INAUDIBLE],, is there a way to handle such data in C?

  • Theoretically unlimited data sets?

  • The way that I would imagine would be just stream the data,

  • and then replace the same memory with that data.

  • Because you really have a finite amount of memory

  • where you can store information.

  • So if your goal is to parse information--

  • the same sort of information-- you're going

  • to want to basically have a chunk of memory that you write to,

  • do some zero operation on that, and then replenish it with new information.

  • And it's similar to what you see in games with object pools and stuff

  • where you use the same objects over and over again.

  • Because if you spawn an infinite object, you'd run out of memory.

  • So that's kind of the same idea.

  • You would keep a limited chunk of memory that you populate with the data

  • that you want.

  • And then you just overwrite that data as you get new input from your stream,

  • basically.

  • We're going to do a stream next week on C, lower C with Nick.

  • So maybe we can cover something similar to that.

  • And Bounty Hunter Ridley, thank you very much for following.

  • ANDY CHEN: So this is the syntax for doing

  • the subset from the regular NHANES into the NHANES pediatric.

  • All right.

  • So the next step is let's actually do some interesting stuff.

  • So you're going to load our first package.

  • And so we call library function.

  • And ggplots2 is the package we want to run actually,

  • we have to install that first.

  • So install that packages ggplot2.

  • [INTERPOSING VOICES]

  • COLTON OGDEN: -with the actual script too?

  • Like, you're not doing anything necessarily in advance.

  • You can just call and install the packages?

  • ANDY CHEN: Exactly.

  • COLTON OGDEN: There's probably a way to do that in Python as well.

  • ANDY CHEN: Yeah, I think you can.

  • COLTON OGDEN: --had an occasion to do it.

  • And Osire, thank you for following.

  • You posted the chat earlier.

  • ANDY CHEN: So we're just installing the ggplot2 package, which is--

  • COLTON OGDEN: It's a lot of stuff.

  • ANDY CHEN: It's a lot of stuff.

  • COLTON OGDEN: It's always fun.

  • It's always fun going through like, if you've

  • ever installed a node package or whatever, NPM,

  • and go through like, all the same packages.

  • Because they all have like, a million subpackages.

  • But to go through and see what all the different individuals subpackages are.

  • Sometimes you have to do that if you have to debug an old NPM project that

  • has a deprecated function or something, you'll have to do that sometimes.

  • And it can be kind of a pain.

  • ANDY CHEN: That sounds like a lot.

  • COLTON OGDEN: It's fascinating going through and sort of digging your way--

  • sort of following the trail of crumbs back

  • like, as low as you can get in node, which admittedly, is high level.

  • ANDY CHEN: No, it's cool.

  • It's cool, to quote David, looking under the hood to see what's happening there.

  • It's still installing n.

  • It's a lot.

  • COLTON OGDEN: It would be so cool if we deduced the cure for a disease

  • in the streams [INAUDIBLE].

  • ANDY CHEN: Yeah, Colton, you got that?

  • [INAUDIBLE]

  • COLTON OGDEN: So no.

  • I don't even know how we would begin to do that.

  • We might have to call in the chat for that one.

  • ANDY CHEN: I think that's on you guys.

  • COLTON OGDEN: Call in the ringers.

  • ANDY CHEN: And girls.

  • COLTON OGDEN: Hello Colton.

  • Welcome to the new [INAUDIBLE].

  • Actually our souls are synchronized working on data set, but on C# project.

  • What are you doing, guys?

  • Just [INAUDIBLE] just genius.

  • Thanks, Goson, for popping in.

  • Welcome to the new-- welcome to Andy, the new friend

  • ANDY CHEN: Hang low, hang loose.

  • COLTON OGDEN: Yeah, that's cool.

  • That's cool that you're doing data stuff in C# as well.

  • Data is just popular in so many environments right now.

  • R and Python are the environments that you typically associate with stats,

  • I feel like.

  • At least the ones that I hear about.

  • But I'm sure that people do it everyday in everything.

  • ANDY CHEN: Yeah, like assembly.

  • COLTON OGDEN: That would be terrible.

  • ANDY CHEN: That would be disgusting.

  • [INTERPOSING VOICES]

  • COLTON OGDEN: I would not want to mess with that.

  • It would be fast, but I would not want to mess with that.

  • ANDY CHEN: It will be very fast.

  • I mean, C's probably fast.

  • COLTON OGDEN: Yeah, C is fast.

  • Could you talk more about variables while we wait?

  • ANDY CHEN: Yeah, absolutely.

  • So there are many, many, many subcategories and categories

  • of variables.

  • But in my mind, I distinguish between categorical,

  • which are things that don't have numbers,

  • and continuous variables, which are things that do have numbers.

  • Although you can sometimes have categorical

  • with numbers, in which case--

  • a common example of that is what's called a dummy variable,

  • which is like, sometimes statistics languages aren't

  • super smart in that you can't just tell it hey, compare this and this.

  • It needs to know, hey, you need to compare one and two.

  • And so people will sometimes plug in like male is one and female is two.

  • And so you make "dummy variables" like that.

  • But really, the distinction between variables

  • is, is this a thing that like, exists in like, a natural number line,

  • or is it really categories?

  • And are you comparing categories to categories?

  • Are you comparing numbers to numbers?

  • Are you comparing numbers to categories?

  • And depending on what you're actually doing,

  • that dictates what kind of statistical test you want to use.

  • And so what I was going to say, right now what we're--

  • I'm a little concerned.

  • This doesn't usually take this long.

  • COLTON OGDEN: You're downloading the entire--

  • ANDY CHEN: Oh God.

  • COLTON OGDEN: --R database.

  • ANDY CHEN: Uh!

  • Anyways, well we'll see where that goes.

  • In this particular instance we want to compare gender, which is a category,

  • it's a binary category actually.

  • So it's a it's a dichotomous variable of two populations, males and females,

  • and a quantitative, a numerical variable of each.

  • So BMI is a number.

  • the average might be like, 15.

  • The average for males might be 15.

  • The average for females might be 16.

  • And then what you're doing in a T test is a comparison of two means.

  • And what it's doing is it's sort of like it's consulting the normal distribution

  • and seeing how likely is the difference between these two numbers?

  • How likely is that difference to actually exist along

  • that normal distribution

  • and so in statistics commonly, we will use

  • things called alphas, which are sort of correlated with statistical power.

  • But these are numbers arbitrarily chosen that we use as cutoffs for like hey,

  • this is likely to happen.

  • So we'll say it's statistically significant.

  • And this is something that has to be done first.

  • You have to state your alpha before yes, you do your test.

  • Let's say my alpha is 0.05--

  • is a very common alpha, 0.05.

  • What that actually means is in a normal distribution, the area of the curve--

  • so it looks like this-- the area of the curve that a value actually exists

  • is 5% of the total area of the curve.

  • In other words, that's approximately two standard deviations from a mean.

  • And so we're using a T test, which is a specific instance of a statistical test

  • that has the specific function-- at least the way I'm using it--

  • of comparing means between two populations.

  • And let's say you have more than two populations, you would use something

  • called an ANOVA, analysis of variance.

  • Is that right?

  • I think that's right.

  • So you use something called an ANOVA, which

  • is analysis of variance, which is another statistical test that compares

  • the averages of multiple populations.

  • So let's say I wasn't comparing males and females.

  • I was comparing elderly, infants, and children.

  • That's three categories.

  • What is the average for elderly.

  • What is the average for children?

  • What is the average for infants and I'm looking

  • at the statistical significance, seeing if there

  • is a statistical significance between the means for each of these three

  • categories, these three populations.

  • COLTON OGDEN: Makes sense.

  • ANDY CHEN: And so depending on what your actual data

  • that you're looking at, what the variables you're looking at are,

  • that dictates which tests you should use.

  • COLTON OGDEN: [INAUDIBLE],, thank you very much for following.

  • Elias in the chat, thank you for joining.

  • He says hello.

  • Andy are just practices 10x future speech.

  • I was thinking FedEx, kappa.

  • ANDY CHEN: Hey, FedEx.

  • I like it.

  • COLTON OGDEN: I like the table.

  • Do you save it?

  • Especially helpful tables, do you have it set aside somewhere?

  • ANDY CHEN: Which table?

  • COLTON OGDEN: I think just in general, just as a principle,

  • do you ever save tables that you--

  • ANDY CHEN: Oh!

  • That was a question.

  • So I think you actually can do that.

  • So right here in this environment panel--

  • you can't see it-- but right here where I'm circling,

  • you'll select your data set.

  • So I'm in NHANES now.

  • And I think you can actually save it and export.

  • So we'll save it as nhanes.text.

  • Save.

  • Ooh.

  • COLTON OGDEN: Required.table.

  • ANDY CHEN: Yeah.

  • OK.

  • Well I guess it saves it in R format.

  • But if you open a desktop, yeah, I guess there's an R data set of n hanes.

  • once I don't open that here.

  • Save.

  • Let's open.

  • [INAUDIBLE]

  • In my desktop.

  • NHANES R.data, yes.

  • I don't know where that loaded.

  • [INTERPOSING VOICES]

  • COLTON OGDEN: [INAUDIBLE] talking about like, the table

  • function that you called as well?

  • ANDY CHEN: Oh!

  • You can absolutely save that as a variable.

  • So I apologize.

  • I think I misunderstood.

  • COLTON OGDEN: That's all right.

  • ANDY CHEN: Thank you.

  • So table, if I run it like this, is just going to provide it here.

  • But let's say I want to save that later.

  • Table saved is a variable name that we will run that as.

  • And then now it's saved as tables saved.

  • And if you want to see it, you just run it again.

  • [INTERPOSING VOICES]

  • COLTON OGDEN: Shout out.

  • David's actually in the chat saying hey, thanks for tuning in with Colton, Andy.

  • Looks like everybody beat us to it.

  • But shout outs the David.

  • Thanks for joining us, for popping in.

  • Did we miss other comments up there?

  • Beverly's asking are we going to do confidence intervals?

  • ANDY CHEN: Ah, confidence intervals.

  • So that's actually very closely related to some of the summary

  • statistics we'll talk about a little bit.

  • I think for the purposes of this stream, confidence intervals,

  • the way I think about it in the context of a T test

  • is it's your mean plus or minus if it's a 95% percent confidence interval.

  • It's the sample mean plus or minus two of the sample

  • standard deviations in your sort of curve.

  • And what it is, is it's a measure of how likely the sample mean

  • that you got from your data is actually your population mean.

  • And so one of the whole point of statistics

  • is it's virtually impossible for most scenarios

  • to know the actual summary statistic, for example, a mean of a population.

  • And so the way that we try to approximate that is by taking samples,

  • which is what NHANES is an example of.

  • I don't know what the average age of all individuals in the United States is.

  • I think that would be 330 million surveys.

  • COLTON OGDEN: People to measure, yeah.

  • That would be tough.

  • I'm not sure--

  • ANDY CHEN: That would be very tough.

  • COLTON OGDEN: I mean, I think the census may have that information.

  • ANDY CHEN: That's true.

  • The census might.

  • But well, let's say for all of this, seven billion humans on the planet.

  • COLTON OGDEN: That would be tough.

  • ANDY CHEN: That would be very difficult. And so what statisticians do

  • is they take a sample, which is like, a representative sample,

  • which is to say that it actually sample sufficiently

  • from all the different demographics that you're trying to actually analyze.

  • And in probability in statistics, we have

  • this idea of distributions, which is to say that whenever we perform samples--

  • or this is actually true for any events that happen--

  • we kind of assume that the likelihood of seeing a particular outcome,

  • whether that's an average age of 15 or 16 or 17 or 69,

  • you sort of assume that that number occurs on a distribution.

  • And the one that we normally think of is a normal distribution

  • or a Gaussian distribution.

  • And it's a bell curve, which is to say that in general,

  • most of your real events, your real samples are

  • going to occur somewhere close to the middle where it actually is.

  • And so that's sort of like the backbone of this kind of statistics.

  • And I think earlier we were talking about probability,

  • or the 95% confidence interval.

  • The idea of a 95% confidence interval is it's

  • trying to say what is the likelihood that the sample you got the,

  • summary statistics you got from your sample,

  • is actually capturing the actual summary statistic of your population.

  • COLTON OGDEN: More data probably helps in that regard.

  • ANDY CHEN: More data does help.

  • Although depending on what kind of which statistician you ask--

  • COLTON OGDEN: And you have to make sure that you're sampling correctly too.

  • ANDY CHEN: Absolutely.

  • It needs to be representative.

  • There are many assumptions in statistics.

  • And so today I was going to talk about all the assumptions,

  • just how to do it in R. But most of the tests I'm talking about

  • are parametric test, which means they have a lot of requirements.

  • A lot of assumptions that have to be met before you can actually use these tests

  • so remember, we said we assumed that data follow some kind of distribution.

  • That's not always a safe assumption.

  • And there's something called the central limit

  • theorem, which is this idea that in general, if your sample size is like,

  • 30 or more, it's very likely to follow.

  • You can assume that it follows a normal distribution.

  • COLTON OGDEN: Enrique 8923, thank you very much for joining.

  • Looks like I think it did finish installing.

  • ANDY CHEN: I think it did, yes.

  • COLTON OGDEN: GGplot2?

  • ANDY CHEN: Ggplot2.

  • [INTERPOSING VOICES]

  • COLTON OGDEN: I think it installed more than two.

  • ANDY CHEN: Ggplot3, am I right?

  • Am I right?

  • All right.

  • So let's see if we can actually do some visualization stuff.

  • COLTON OGDEN: The ggplot2, it's meant to make graphs, those sort of things?

  • ANDY CHEN: Correct.

  • COLTON OGDEN: Is gg short for graphing something?

  • ANDY CHEN: I think one of the g's is grammar, graphing grammar, I think,

  • plot?

  • And the two is it's the second iteration of the package.

  • COLTON OGDEN: Makes sense.

  • Makes sense.

  • Sample of data should be diverse says [? Babbick Night. ?] Big data does not

  • equal good data, says [INAUDIBLE].

  • ANDY CHEN: That's true.

  • It needs to be representative, absolutely.

  • For instance, let's say I take a sample that's only in the city of Pittsburgh.

  • I cannot say that whatever my findings are,

  • are applicable to the United States in general.

  • But if it were representative, if it were diverse--

  • if I took 30 people from Seattle, Los Angeles, Pittsburgh, Denver, Houston,

  • then you could make the argument that it is representative,

  • and whatever your findings are can be extrapolated too.

  • COLTON OGDEN: That would be representative

  • of urban environments, but not necessarily more rural environments.

  • That's like, outskirted or remote cities or towns.

  • So that would be something even more to take into consideration.

  • ANDY CHEN: Absolutely true, yeah.

  • COLTON OGDEN: I have seen lots of errors and inaccurate info

  • in health care files and hardships in those patients getting these corrected,

  • though they should be able to.

  • This is an issue that comes up as awareness among biostaticians.

  • I think it's unfortunate [INAUDIBLE] might come

  • to conclusive results based on such.

  • ANDY CHEN: That's a really good question.

  • I actually don't work in industry.

  • But from what I hear from colleagues who do

  • or who have themselves heard about it, EMR, EHR,

  • which are Electronic Medical Records, Electronic Health Records,

  • one of the issues with them in their implementation in the United States

  • health care system is that they're relatively new.

  • We're in a state of transition from written to electronic health records.

  • And part of the difficulties of that transition

  • are getting health care providers to actually use the system correctly

  • in that manner.

  • And that it responds to is an issue and something that

  • comes amongst biostaticians?

  • I think that a good biostatitican is absolutely aware of those things.

  • And he or she tries his or her best to ameliorate and work

  • with the data as much as they can.

  • Or if they come up with some kind of conclusion

  • that they put the caveat, given the limited data, or like,

  • this data can only be extrapolated to whatever that limited representation

  • happens to be.

  • COLTON OGDEN: Makes sense.

  • JL97 finally caught the stream.

  • Thank you for joining us today.

  • Very nice.

  • ANDY CHEN: OK.

  • So ggplot loaded.

  • So an example of a plot that ggplot can make is a histogram.

  • This is actually going to take me a while to type

  • out, because ggplot2 is very powerful, but it's not my favorite.

  • It requires a lot of typing.

  • COLTON OGDEN: A verbose library?

  • ANDY CHEN: It's very, very verbose, correct.

  • COLTON OGDEN: So I get that sense from like, map plot lib as well.

  • ANDY CHEN: Map plot lib is like that too.

  • Although map plot lib is more understandable to me.

  • Like, for example, geom, like geometry?

  • But you know, it could be so many things.

  • COLTON OGDEN: Geometric something or other.

  • ANDY CHEN: But it could be so many things.

  • And this, in a sense is telling me what I should want, a histogram.

  • COLTON OGDEN: Yeah, that is interesting.

  • I'm not sure.

  • You would think it'd be something like style.

  • ANDY CHEN: You would think, yeah.

  • But I don't know.

  • Maybe I'm just not used to it.

  • But--

  • COLTON OGDEN: [? Babbick ?] said, "This reminded me of 6,002x."

  • I'm not sure what that is.

  • Do you know that is?

  • ANDY CHEN: 6,002x?

  • Uh-uh.

  • COLTON OGDEN: I'm not sure what that is, [? Babbick. ?]

  • ANDY CHEN: Oh, 6002.

  • Is that an MIT class?

  • COLTON OGDEN: Oh, 6002x?

  • Yeah, that might be.

  • ANDY CHEN: The online version of it.

  • COLTON OGDEN: Yeah.

  • ANDY CHEN: That's, course, 6 is electrical and electrical engineering,

  • computer science.

  • I don't know 6002 though.

  • COLTON OGDEN: That would make sense.

  • We can we can use the Google machine.

  • 6002x course, maybe?

  • Circuits and electronics?

  • Is what it is?

  • MIT 6 point-- yeah, it's circuits and electronics.

  • ANDY CHEN: Cool.

  • COLTON OGDEN: Interesting.

  • ANDY CHEN: 2 plot.

  • Oh, OK.

  • So once you get started we have to call the library.

  • COLTON OGDEN: Ah, interesting.

  • ANDY CHEN: There we go.

  • COLTON OGDEN: OK.

  • Oh, it's like requiring it.

  • ANDY CHEN: Yeah.

  • COLTON OGDEN: Got it.

  • Makes sense.

  • ANDY CHEN: Let's run it.

  • OK.

  • COLTON OGDEN: Oh, nice.

  • ANDY CHEN: We have a graph.

  • COLTON OGDEN: And you have like a dedicated browser to see--

  • ANDY CHEN: This little guy right here.

  • COLTON OGDEN: --on the right side there, we're a little bit--

  • I can hide us briefly.

  • It's going to break everyone's heart but--

  • we don't have a fancy transition off of that either.

  • ANDY CHEN: Oh, are we gone?

  • That's OK.

  • The world doesn't need to see our pretty faces.

  • COLTON OGDEN: What would we do with out Google says [INAUDIBLE]??

  • ANDY CHEN: Yahoo?

  • Although, I would hate to see the world where we have to use Yahoo, or Bing,

  • god forbid.

  • COLTON OGDEN: "The results being significant is dependent on your alpha.

  • If you're setting your own alpha wouldn't that

  • make it very easy to manipulate the results into being

  • significant, or not significant?"

  • ANDY CHEN: Great question.

  • So, yes.

  • In a lot of scientific fields, psychology, especially I think,

  • is pretty susceptible to this.

  • From what I've been told.

  • I'm not-- I don't have--

  • just from my friends who are in that field.

  • Yes.

  • Essentially, the way the statistics works

  • is it assumes that there is an underlying

  • distribution for sampling events.

  • That like if you flip a coin 1,000 times there is some distribution that it's

  • going to show up in.

  • Like there will be 500 heads, 500 tails.

  • That's a binomial distribution.

  • I forget, that might be a Bernoulli distribution.

  • But the assumption is that in nature there

  • are some kinds of distributions that actually exist.

  • If you perform some kind of event so many times

  • it will follow a general pattern.

  • The more you perform, the smoother the pattern will be.

  • And so a lot of the kind of statistics that we're doing today

  • is we're assuming they follow a normal, or a Gaussian distribution.

  • And so, yes, you're absolutely right.

  • The alpha that we set is arbitrary.

  • So if you think about it in that terms, let's say,

  • simplistically, one journal only publishes

  • articles that are very just only t-tests, like a single t-test.

  • If our alpha is .05, that means that 5% of the time, or 1 in every 20

  • of these articles, one of your answers is going to be incorrect.

  • It's going to be-- you think it's statistically significant,

  • but it's actually happening 1 out of every 20 times.

  • So yes, statistics is like that.

  • It's we're assuming a lot about the natural world

  • that it follow certain rules, certain distributions of events.

  • And we're applying statistics to them to sort of quantify

  • the likelihood of something being true, or not true.

  • But that's statistics.

  • COLTON OGDEN: [INAUDIBLE] is saying, "I was promoting the stream to a Columbia

  • professor, and he said, R was for statistics,

  • and Python for DNA comparison."

  • ANDY CHEN: Yeah, actually, I would agree with that to some degree.

  • I do most of my bioinformatics work in Python.

  • Although, Professor Rafael Irizarry, who is a Bion for Mass

  • professor at the T.H. Chan School, the Harvard School of Public Health.

  • He does a lot of his work, I think, in R.

  • So you can conceivably do either in either.

  • COLTON OGDEN: The saying is, different strokes for different folks Right?

  • ANDY CHEN: That's true.

  • Different strokes for different folks.

  • COLTON OGDEN: I just realized that this chat is a little small.

  • Because I had to shrink it down yesterday.

  • Let's make it a little bit bigger.

  • There we go.

  • ANDY CHEN: Cool.

  • Cool.

  • COLTON OGDEN: That looks nicer.

  • ANDY CHEN: It does look nice.

  • And so do you.

  • COLTON OGDEN: Thank you.

  • I appreciate that.

  • You look amazing.

  • Look at you.

  • ANDY CHEN: Ah.

  • OK.

  • So we've made a plot.

  • And so, the one thing that's interesting about-- so we're assuming--

  • this is actually an interesting plot because we

  • were talking about Gaussian plots earlier, very normal distributed.

  • This is not a very normal plot.

  • That might be OK, depending on what we want to do with this data.

  • But I want to talk about some characteristics.

  • It's very piqued, in a term in statistics

  • to use plots that are-- so let's say a normal plot looks like this.

  • If it looks like this, like a very large peak, we call that leptokurtic.

  • And if it's very flat--

  • if it looks like--

  • so let's say this is a normal curve, if it looks like, boop, just like that--

  • we call that platikurtic.

  • And this also is--

  • it gets like sort of narrow on the right side.

  • Let me see if you can't see my head.

  • Oh, dude.

  • What are these?

  • We call that right skew.

  • COLTON OGDEN: What I should do is just, whoop.

  • ANDY CHEN: Oh, look at that.

  • COLTON OGDEN: And then just gives us a little shrink.

  • We did this on the first stream, actually.

  • ANDY CHEN: Shrinking boys.

  • COLTON OGDEN: Use a little shrink down.

  • ANDY CHEN: A little shrinking shrink.

  • COLTON OGDEN: I gave it a little shrink down.

  • ANDY CHEN: I'm into it.

  • COLTON OGDEN: Something like that.

  • ANDY CHEN: I just got really small.

  • Yeah.

  • So this is a right skewed curve, excuse me.

  • And if the tail were on the other side, it would be a left skew.

  • And so these are just terms to describe the shape of a distribution.

  • COLTON OGDEN: OK.

  • Makes sense.

  • ANDY CHEN: Oh, a histogram.

  • We should definitely explain that.

  • I apologize.

  • A histogram is a certain kind of chart, a certain kind of diagram,

  • that shows exactly like--

  • is that [? Babbick? ?] [? Babbick ?] is saying--

  • COLTON OGDEN: [INAUDIBLE].

  • ANDY CHEN: Is it the frequency?

  • Yeah. [? Babbick ?] is saying, it's showing

  • the frequency for a certain thing.

  • So it's actually-- it's sort of uni-dimensional.

  • It's not like a regular plot that has a x and y-axis,

  • where an x is an independent variable, and a y is a dependent variable.

  • A histogram only has an x, in the sense that, the only thing

  • that's like data is the x--

  • it's the BMI.

  • And then the y-axis is showing frequency--

  • the number of times that this particular value occurs.

  • So for instance, a BMI between 0 and, this looks like it might be 15,

  • occurs maybe 240 times.

  • Whereas a BMI between 16 and looks like 25 occurs probably like 1700 times.

  • So that's what a histogram is.

  • COLTON OGDEN: That would make sense.

  • [? Babbick, ?] is this BMI histogram skewed left?

  • ANDY CHEN: It's skewed right.

  • COLTON OGDEN: Oh, OK.

  • ANDY CHEN: I'm pretty sure it's skewed right.

  • The skew, like, the tail, the part the slimmer,

  • I think that's the direction you use to describe if it's left or right.

  • COLTON OGDEN: Interesting.

  • Then it's going to be the right skew then.

  • Yeah.

  • Because nobody over 40, there's like very few people over like 40 to 60 BMI.

  • Because at that point, that's vary--

  • ANDY CHEN: Right.

  • Because that's-- yeah, that might not be humanely possible to be there,

  • to have that kind of BMI.

  • OK.

  • So now that we've visualized our data, and this is generally good stuff.

  • There are more visualizations that you should do to test your assumptions.

  • Let's actually try to perform this little test.

  • We're trying to perform a t-test, which is comparing the means and the mean BMI

  • values between two populations, males and females,

  • in our pediatric subset of NHANES.

  • And so, the way that we do that--

  • oh, let's overwrite some.

  • So we noticed that there are some data entries that have empty BMIs.

  • Wait.

  • This is not letting me get in here.

  • COLTON OGDEN: [? Magnus ?] is saying, Is there somebody that's around 50?

  • It looks like there is in the data set.

  • It's just very small.

  • Right?

  • ANDY CHEN: It might be.

  • Yeah.

  • There might be a super--

  • COLTON OGDEN: Because it looks like the red line goes all the way up

  • to 60 something, 65, 63?

  • ANDY CHEN: Yeah that's really, really high.

  • Yeah, I think that's an individual, and someone who's

  • under 18, a pediatric individual who has a very high BMI.

  • There we go.

  • All right, let's make some space.

  • Great.

  • So we noticed earlier, probably 20 minutes ago I guess,

  • in the stream, some of our BMI individuals were missing.

  • And again, I'm going to reiterate this as very potent statistics,

  • be sure to consider and at least explain why you

  • decided it's reasonable to omit data.

  • And I'm just going to hand wave over that

  • for the express purposes of demonstrating

  • how to perform a t-test in R. You should, you know, explain to yourself,

  • be able to explain to anyone, who you're showing your results to why you've

  • entered this data.

  • So the way that I'm going to do that is I'm

  • going to overwrite NHANES pediatric.

  • Again, called the subset function with NHANES itself.

  • This time with not is.na.

  • Remember, is.na means, that the thing itself is missing, is na.

  • But not is.na is, it is not missing the thing.

  • For gender as well as for BMI.

  • COLTON OGDEN: Andre is asking, "If the data were not

  • grouped into batches of 10 would it not fit a Gaussian curve?"

  • ANDY CHEN: Oh, that's a very good question.

  • So this, I set the binwidth as 10.

  • I can-- so right here in line 24, I can set that to be 1 if I would like.

  • It's an arbitrary binwidth.

  • And so that shows a much clearer resolution.

  • But a better way to think about this is, a Gaussian curve is actually not

  • discrete.

  • A word which means that is like sort of individual numbers.

  • It's a continuous, it's a smooth curve.

  • So a Gaussian distribution does not actually have individual rows.

  • It just has, choo, all of the all the things in a single curve.

  • And so to demonstrate that, I chose--

  • I had arbitrarily chosen binwidth of 10.

  • But we can choose 1.

  • We can choose 0.5.

  • We can choose whatever we want.

  • So in line 26, what we've done is we have resubsetted our data,

  • and this time we've removed any individuals that

  • are missing entries for gender, or BMI.

  • And so now that we've done that, let's perform some actual statistics.

  • We're getting there.

  • COLTON OGDEN: Here we go.

  • Here we go.

  • ANDY CHEN: All right.

  • Thank you so much for bearing with us.

  • COLTON OGDEN: Asymptotically approaching, as David would say.

  • ANDY CHEN: Absolutely.

  • Oh, yeah.

  • OK.

  • So let's save-- so tapply is a function that

  • lets us take in our data set as a argument,

  • specifically looking at the BMI variable.

  • And we will also compare that to the same data set,

  • but this time with gender.

  • And then the last argument in tapply is going

  • to be what actual thing you want from it.

  • And the first thing we want to do is, I want take it's minimum.

  • And then-- so I'm actually going to do that a bunch but with max, mean--

  • oops, won't do that.

  • Whoops, there we go.

  • Mean-- mean-- standard deviations--

  • and then we'll just run all those.

  • Run.

  • Run.

  • Run.

  • Run.

  • So these are all saved.

  • We haven't seen them in our console because we saved them to a variable.

  • But something that is cool that I'm about to do

  • is, we can actually cbind all of these things into a single table.

  • Which again, is a very useful one line command in R

  • that might be difficult to do in Excel.

  • You can probably do it very smartly in Python, though.

  • So we're going to make a new summary table variable.

  • And we're going to data.frame and save as a data frame.

  • cbind of min, max, mean, and SD, which are the barriers we just made.

  • And then we're also going to do summary table again, just to display it.

  • So line 33 is going to save all those things as summary table.

  • And then line 34 is going to run it and print it.

  • And there you go.

  • We have a very beautiful, very easy to read, table.

  • For females, the minimum value is 12.88 for BMI.

  • For males, the minimum value is 12.89.

  • The mean is 20.49 for females.

  • The means 20.05 for males.

  • And so we have our basic summary statistics.

  • COLTON OGDEN: Interesting.

  • ANDY CHEN: Now, looking at this Colton, and knowing

  • the sample size is approximately 1,000 individuals for males and females,

  • do you think there is a significant difference between the average BMI

  • for males versus females in this pediatric data set?

  • COLTON OGDEN: Substantial?

  • ANDY CHEN: Significant.

  • COLTON OGDEN: I don't know if I would describe it as being significant.

  • Would you describe it as being significant?

  • ANDY CHEN: Well, that's exactly the answer I wanted.

  • You can't tell.

  • So these are summary statistics which tell you a mean,

  • or median, as well as a measure of center,

  • which is saying sort of like, what is like, the middle kind of value.

  • Whereas, standard deviation is a measure of dispersion, which tells you

  • how spread apart is your data.

  • But it doesn't actually tell you anything about if these two

  • populations are similar, or dissimilar.

  • And so to do that we would have to use a statistical test.

  • In this case, since we're comparing a binary variable

  • with a continuous numerical variable we have to use a t-test.

  • And so the syntax for that--

  • COLTON OGDEN: And also Fatma, in the chats,

  • said, "Can R be added to the sandbox@cs50.io.

  • And from what I see here, I went on to sandbox@cs50.io it is already--

  • we do have an R option.

  • Probably don't have RStudio available.

  • But the R command line environment we do have.

  • It looks like we do have a sandbox setup for that already.

  • ANDY CHEN: Great.

  • Nice.

  • So if you want to teach yourself how to use the command line,

  • that's a great skill to have too.

  • "Are we going to regression?"

  • I don't think we will hit regression today.

  • We are going to do a t-test.

  • If we have time, we'll do an ANOVA.

  • Actually, if we have time, we'll do a linear regression.

  • But yeah, we'll see how far we get today.

  • So t-test, let's see--

  • so again, when we look at p values, right?

  • We determine if something a significant, or insignificant.

  • In statistics, before we actually do the analysis,

  • we have to come up with something called h not which is a null hypothesis.

  • And something called h sub a or an alternative hypothesis.

  • So in statistics, the default is you assume that there

  • is no significant difference.

  • Right?

  • That is the default. So the null hypothesis here

  • is there is no significant difference between the mean BMI for female

  • versus males in this sample population.

  • That is the null hypothesis.

  • The alternative hypothesis is there is a significant difference

  • between the average BMI for males and females in this population.

  • And so those are sort of the underlying statements that you're working with.

  • And that's what your p value for your t-test

  • actually tells you to either accept, or reject, the null hypothesis.

  • And so we'll get to that in one minute.

  • Let me actually perform the t-test.

  • The syntax for that is t.test BMI.

  • So the first thing that goes into the function call

  • is the continuous variable that you're comparing--

  • or the categorical variable that you're comparing.

  • Sorry-- the continuous variable that we're comparing-- the numbers,

  • the thing that actually has a mean, or an average.

  • And we're going to compare against the categorical.

  • In our case, the males versus female.

  • In this case, it's a dichotomous variable,

  • which is an instance of a [? categorical ?] variable.

  • Let me do gender.

  • We have to tell it that the data we're using is our NHANES data set.

  • And we're going to assume that variance is equal to true.

  • So we could talk about that in a little bit,

  • but there are actually a few different kinds

  • of t-tests where you can assume that the variances are true, or not true.

  • And the reason that they're different is the statistical power is--

  • the way it's implemented is actually slightly different.

  • In this case, we're assuming that for the most part,

  • male and female participants in NHANES are, for the most part,

  • very similarly except for their sex.

  • I should probably put this in my script so I don't lose it.

  • So let's run that.

  • And here's what happens.

  • We get this fees out-- two sample t-test.

  • So remember, that's another kind of--

  • there are different kinds of t-tests.

  • One of them is two samples, which is you're

  • comparing two different populations.

  • You can also compare a single, one sample t-tests,

  • in which it's something sort of like a before and after.

  • Because otherwise, it's exactly the same except for the one test condition

  • that you've tested.

  • We look at the p value here--

  • 0.08402.

  • Now if we had chosen an alpha 0.05.

  • It is greater than our alpha of 0.05, which

  • means we have to accept the null hypothesis, which

  • Colton suggests that this data is significant, or insignificant?

  • COLTON OGDEN: Well, repeat the question one more time.

  • ANDY CHEN: The null hypothesis is that there

  • is no significant difference between the mean BMIs of males versus females.

  • COLTON OGDEN: And you're if it's greater than the alpha 0.05

  • then that is the case?

  • ANDY CHEN: Well, you fail to reject.

  • COLTON OGDEN: Which means then, that we have to say

  • that there is no big major significant.

  • Right?

  • Am I wrong?

  • ANDY CHEN: No.

  • You're correct in your intuition.

  • But I just want to be--

  • in statistics, people are very careful.

  • You don't really-- you fail to reject the null hypothesis.

  • COLTON OGDEN: Right.

  • So we can't-- we're not asserting either, or the other,

  • but we are rejecting the null hypothesis--

  • ANDY CHEN: Absolutely.

  • COLTON OGDEN: --in this instance.

  • ANDY CHEN: Exactly.

  • COLTON OGDEN: OK.

  • We're not establishing the hypothesis as being true,

  • the null hypothesis as being true.

  • Or the opposite the null hypothesis as being true.

  • ANDY CHEN: Exactly.

  • COLTON OGDEN: Right.

  • OK.

  • ANDY CHEN: So let's actually assume that instead of saying 0.08, it said 0.03.

  • Now what would happen?

  • COLTON OGDEN: 0.013?

  • ANDY CHEN: Just less than 0.05.

  • COLTON OGDEN: Then we have to say that we can--

  • what is the opposite of--

  • ANDY CHEN: You reject the null hypothesis.

  • COLTON OGDEN: Reject the null hypothesis.

  • ANDY CHEN: So you don't really accept the alternative hypothesis,

  • but-- because you can't really say for certain, or most at statisticians

  • would hesitate to say for certain, oh, there is a significant difference.

  • Well, I guess that is what you would say.

  • But a sort of more careful, more conservative,

  • way of saying that is like, oh, we just can't reject--

  • we can't-- we fail to reject the null hypothesis.

  • COLTON OGDEN: OK.

  • So we're basically avoiding making any assertions either way.

  • ANDY CHEN: Pretty much.

  • Yeah.

  • COLTON OGDEN: OK.

  • At which point can we assert either situation, either case?

  • ANDY CHEN: In scientific writing, when people use statistical tests, if the p

  • value is less than your alpha value, what you'll usually say is,

  • we found significant results.

  • COLTON OGDEN: OK.

  • ANDY CHEN: Although a statistician might be a little more conservative about how

  • they would say that.

  • COLTON OGDEN: OK So this has to kind of undergo repeated testing,

  • and repeated samples, and repeated sort of accepting, or rejecting of a null

  • hypothesis over the course of time before we can affirmatively

  • say one way or the other?

  • ANDY CHEN: Ideally.

  • COLTON OGDEN: And even then, it's probably still not 100%.

  • ANDY CHEN: Ideally.

  • Yeah, that doesn't usually happen, but yeah, you're

  • absolutely on the right track in your intuition of how the statistics works.

  • COLTON OGDEN: OK.

  • That makes sense.

  • ANDY CHEN: But yeah.

  • Let's see if there's any--

  • COLTON OGDEN: Fatmo was saying that they were having internal server error.

  • So the-- it looks like sandbox is, at least in America,

  • we can see that it's up, but it's experiencing updates

  • so it's off line, technically.

  • But the landing page is still up.

  • So I think we're updating it probably for the hackathon, or whatnot.

  • It should be up at some point in the near future.

  • Can you tell the difference between one sample and two sample t-tests

  • again, [INAUDIBLE].

  • ANDY CHEN: Sure.

  • Let's actually look it up so I am not--

  • one sample, versus two sample t-test.

  • So a two sample t-test has two populations

  • that are different in some sense.

  • So males and females are two separate individuals in this NHANES t-set.

  • Whereas a one sample t-test is--

  • ah!

  • Sorry, I actually-- so I misunderstood--

  • I misexplained how--

  • I misexplained one simple t-tests in the beginning.

  • A two sample t-test is comparing if two populations, two sample populations,

  • have a significant difference in their average for whatever variable

  • you're looking at.

  • A one sample t-test is comparing a known true population mean, some value,

  • to the sample you're saying, and seeing if there's

  • a statistical difference there.

  • So I'm trying to think off the top of my head of there's a good--

  • let's say that we know that the average graduation--

  • or let's say that we know that the population of a city is 50o,000.

  • We just know this for a fact.

  • We would perform a one sample t-test if we're

  • trying to compare if it's statistically significant that one of our samples

  • gives us 455,000 instead of 5000,000.

  • And then in that particular instance you would use a one sample t-test.

  • COLTON OGDEN: Cool.

  • Cool.

  • Makes sense?

  • ANDY CHEN: Yep.

  • Cool.

  • So I think, unless there are a lot of questions on t-tests,

  • we can actually probably go on to ANOVA.

  • COLTON OGDEN: Sure.

  • Let's do it.

  • ANDY CHEN: Sure.

  • So we talked earlier about different kinds of variables.

  • T-tests are useful for when you are comparing the difference of means

  • between two populations.

  • But let's assume, for instance, that you have a categorical variable that

  • has multiple categories--

  • young, old, medium aged, young adult, et cetera.

  • And you still want to compare an average of some kind

  • of continuous variable between them.

  • So what is the average height of a child, average height of an infant,

  • average height of an adult, average height of an elderly person,

  • and you're trying to look to see if there's

  • a statistically significant difference between those populations.

  • In that particular instance, you would use a test called an ANOVA,

  • or an analysis of variance.

  • So there are actually a lot of assumptions for each of these tests

  • that we're using, and I just want to reiterate that.

  • There are-- you should look at your data, see if there are outliers,

  • and see if they meet the parametric assumptions of each of these tests

  • that you're using.

  • All these test we're using here today are parametric.

  • If not, you can use something called, non-parametric tests,

  • which are sort of similar in what they do, but are probably

  • not as statistically powerful.

  • But everything we're doing today is parametric.

  • So we're just going to assume that we've met all

  • these assumptions to use these tests.

  • So ANOVA-- let's look at ANOVA.

  • Let's make a variable called, ANOVA BMI with race.

  • And so the category of race in our data set is called,

  • Race 1, is looking at if--

  • I guess the surveys have, do the individuals identify

  • as white, or as black, or Mexican, or other, or Hispanic, or et cetera.

  • And so, as we can tell, these are not numbers, these are categories

  • and there's more than two of them.

  • And let's compare that to BMI again.

  • Just because that's what we've been using.

  • So in this particular instance, the statistics question we're asking

  • is, is there a statistically significant difference

  • in the average BMI between these ethnicities, or between these races?

  • I think is the term they use in this data set.

  • And so the way that we would do that is we would call aov,

  • is the syntax in R. aov of BMI, which is your continuous variable.

  • And then, squiggly enyay to race 1, which again,

  • is the name of our variable for race.

  • Oh, and the second argument is the identify your data.

  • We're going to be looking at our NHANES pediatric that we made.

  • And then we are going to call--

  • OK.

  • So let's actually do that.

  • We run this function and it saves your ANOVA results

  • into a variable called ANOVA BMI race.

  • Now to actually access it, we do summary of a ANOVA BMI race, right here.

  • And so if we run that, it prints out our [INAUDIBLE] results.

  • Excuse me.

  • So if you are interested in ANOVA a lot of these are--

  • some of this data is actually-- some of these numbers

  • are actually very important.

  • The f value, for instance, is a very useful thing.

  • But we're just going to be looking at the statistical significance for right

  • now.

  • Our p value is 0.00781 star, star, which means its alpha is 0.005.

  • That value is less than 0.005.

  • No.

  • That's not true.

  • Well, regardless.

  • If we had set our alpha as 0.05, this value is less than that.

  • Which means we would say, we fail to reject the null hypothesis.

  • And so in this particular instance, we would

  • have set the null hypothesis before, and it

  • would be the null hypothesis is that there

  • is no significant difference in the mean BMIs between categories of race--

  • white, Mexican, other whatever the categories had.

  • And so because our p value is less than 0.05,

  • we would fail to reach our null hypothesis,

  • which means that we would probably say something along the lines of,

  • we find that there is a significant difference in the BMI

  • means across these categories.

  • So Colton, can you see why this might be more ambiguous than a t-test, which

  • only has two categories?

  • COLTON OGDEN: Well, I mean, there's just so many races.

  • I don't know if that's part of the issue.

  • ANDY CHEN: Right.

  • Absolutely.

  • If I find--

  • COLTON OGDEN: And also people can subjectively

  • identify as multiple different races a lot of the time.

  • Because people can have parents of different races.

  • ANDY CHEN: Absolutely true.

  • Yeah.

  • From a statistical standpoint, I think the first note is now

  • that I know it's significant, where is the significance?

  • Is it between white and black?

  • Is it between Mexican and white?

  • Is it between other and white?

  • It's ambiguous.

  • COLTON OGDEN: And you have to take probably smaller samples,

  • separate samples, and see how they compare against each other.

  • Many different sort of permutations.

  • ANDY CHEN: Actually, yeah, that's absolutely right, actually.

  • So that actually goes a little more low level into what I'm about to show.

  • But so in ANOVA, because your p value is ambiguous, because your--

  • I was just reading one of the comments.

  • COLTON OGDEN: The last one?

  • ANDY CHEN: Yeah.

  • COLTON OGDEN: Would you have just a general statistics

  • review approaching AP testing time?

  • So informative.

  • Maybe.

  • I don't know when AP testing time is.

  • ANDY CHEN: I don't know when that is.

  • But yeah, perhaps we could maybe do that.

  • We're talking about-- oh, right.

  • It's ambiguous.

  • We don't know where the statistically significant differences is.

  • We don't know between which categories, or between all categories is.

  • So we do something called, post hoc, Latin for, after the fact tests.

  • One of which we'll show.

  • There are different kinds of tests that you use in different circumstances

  • but we're going to use one called the tukey's post hoc test.

  • So the way we do that is, tukey HSD of ANOVA, or of the variable we just made.

  • And so if we run that, we find the printouts of p values of the categories

  • compared to each other.

  • So you were actually saying you run subsets of--

  • basically what this is doing, is it's very similar to running t-tests

  • within all the possible categories.

  • COLTON OGDEN: Right.

  • OK.

  • That makes sense.

  • ANDY CHEN: And so the reason it's p adjusted,

  • is because when you do that the degrees of freedom changes.

  • And so you actually sort of inflate or your power.

  • And so the tukey's t-test is one implementation

  • of this kind of sequential sub partitioning.

  • And so this what this gives us is, we look at the differences--

  • is the average BMI of Hispanic versus black individuals significantly

  • different?

  • The p value here is 0.60.

  • It's not significant at all.

  • And so most of these are not.

  • But if we look at, other with black, then what this is saying is,

  • the null hypothesis here, like the special sub null hypothesis here,

  • is that is there a significant difference

  • between the average BMI for individuals who are other

  • compared to individuals who are black?

  • In this case, our p value is 0.03, which is less than our alpha of 0.05.

  • In which case we would say, we failed to reject our null hypothesis,

  • that there is not a significant difference.

  • And so you would say, we find a significant difference

  • between the mean BMI for individuals who identify as other compared

  • to those who identify as black.

  • And so it shows you all the possible category differences.

  • So it looks like there is five categories?

  • 1, 2, 3, 4, 5.

  • Yeah, Hispanic, black, Mexican, other, and white,

  • and so it's looking at all the possible permutations between,

  • and looking at if those specific two categories are

  • significantly different enough.

  • And so that's how you would analyze a ANOVA.

  • COLTON OGDEN: Get a little bit more granular.

  • ANDY CHEN: Getting a little bit more low level under the hood as David

  • might say.

  • OK.

  • I think we actually have time for a linear regression.

  • I think that will probably be the last thing we talked about though.

  • COLTON OGDEN: OK.

  • Sounds good.

  • Let's look at it.

  • ANDY CHEN: All right.

  • Oh, this is a lot of code to type.

  • OK.

  • So again, a very, very good statistical habit to get into

  • is visualizing your data before you make analysis, just

  • to understand what's going on if you have a lot of outliers, et cetera.

  • And so what we're going to do, is we're going

  • to visualize the NHANES pediatric data set that we made of course.

  • X equals age, y equals height, plus [? giam, ?] point.

  • Let's run that.

  • All right so what this did, is it's printing out,

  • in it's lower right corner here.

  • All of the individuals of various heights in our pediatric data.

  • So from 0 to eight-- or sorry, ages from 0 to 18, and their heights.

  • And so you'll notice that it has a weird distribution where it's very, you know,

  • blocked.

  • And the reason for that is because we don't

  • think about age in terms of continuous.

  • I'm not 24 and 3/4 years old.

  • Right?

  • I identify as 24.

  • And so that's why each of these only is the height

  • for one-year-old, 2-year-old, 3-year-old, four-year-old, not 2.4856

  • six-year-old.

  • And that's why it has that distribution like that.

  • COLTON OGDEN: Right that's supposed to be a continuous graph.

  • ANDY CHEN: Exactly.

  • That's exactly right.

  • And so now that we see this, Colton, if I

  • were if I were to give this data to you, do you see--

  • do you think that there is a general trend,

  • the correlation between age and height?

  • COLTON OGDEN: It looks like a small one.

  • As you get older, you tend to get taller.

  • ANDY CHEN: Yeah.

  • Right.

  • And so intuitively that makes sense to us.

  • I chose this example because it makes sense that the older you get,

  • the taller you taller you get.

  • COLTON OGDEN: And also range in height tends to grow as well.

  • ANDY CHEN: That's true.

  • That's actually, really, yeah.

  • That's really interesting.

  • I hadn't noticed that.

  • I mean that makes a lot of sense.

  • Right?

  • Because like we all start about the same size, give or take.

  • COLTON OGDEN: Yeah.

  • People grow at different rates, different sizes.

  • ANDY CHEN: Absolutely.

  • Oh, one thing I forgot to mention is--

  • COLTON OGDEN: Oh, sorry, [? Babbick, ?] let me let me go ahead and move it.

  • Move it right over--

  • I'll make it smaller again, how's that?

  • Let's go up to here.

  • Very tiny chat so we can see the graph just while we're talking about it.

  • ANDY CHEN: So linear regression is an instance of what--

  • is a specific instance when we're comparing

  • continuous quantitative variable with another quantity variable.

  • Age is a number from 0 to 18.

  • In theory, you could actually have this like-- well,

  • I have them as discrete numbers 1, 2, 3, 4, 5, 6 but you

  • can have it as like 1.5, 2.8, whatever.

  • And height is also a discrete number, from 0 to 175 centimeters.

  • And so when you're trying to compare two continuous variables, two

  • quantitative variables, you use linear regression.

  • It's actually really similar to an ANOVA except for the independent variable

  • in an ANOVA is categorical instead of continuous.

  • Right.

  • OK.

  • So, so the way that we would perform linear regression-- oh,

  • so I think there tends to be--

  • it looks like there might be a linear regression here.

  • And there might be a better model that actually describes it.

  • But for our purposes let's perform a linear regression.

  • Because I think someone was suggesting we do a regression.

  • So ggplot.

  • It's the same exact thing, but we'll add some lines.

  • It's starting to get really unwieldy, which

  • is one of the reasons I don't like ggplot, is I have to--

  • look at what I'm about to write here.

  • Smooth method = lm.

  • lm is linear model.

  • S = true.

  • I don't know what that is.

  • Full range = false.

  • COLTON OGDEN: And also, [? Fatma ?] thank you for the kind words.

  • ANDY CHEN: 0 +--

  • COLTON OGDEN: And [? Babbick ?] as well.

  • ANDY CHEN: 0.95.

  • There we go.

  • Run.

  • Cool.

  • So we have the graph we had before, and we've actually fitted a model to it.

  • A linear regression.

  • A line model.

  • So recall that in linear regression makes

  • sense, in terms of, these are all models, statistics

  • is all about modeling.

  • If you have a categorical variable, how can you make a line with that?

  • Right?

  • There's no axes to make lines on.

  • But if you have two continuous variables,

  • you absolutely can make a line.

  • Which is why a linear regression makes sense

  • if you're comparing a continuous with another continuous variable.

  • So I just wanted to plot out what it actually

  • looks like in this particular-- for when you're

  • comparing a continuous variable exchange variable and it's a linear regression.

  • But to actually perform it in R, you perform the following.

  • We're going to make a variable called, linear regression.

  • And the syntax is lm, which stands for linear model.

  • We're going to do our--

  • Oh, did I do weight?

  • Ah, that's fine.

  • Our independent variable, and then squiggly enyay thing, with our--

  • sorry, dependent variable.

  • And then our independent variable on the right.

  • Data is again NHANES pediatric.

  • And, yeah.

  • So we'll run that.

  • And then we'll call summary, which is a function that gives you

  • the summary for certain kinds of data.

  • Call that and look.

  • We get our linear regression summary.

  • So there are multiple p values here but I could go into depth in a little bit

  • more perhaps next time, but essentially what's happening,

  • is there are actually two things that come out of a linear regression.

  • Remember it's y equals mx + b because it's a line.

  • There are two variables here--

  • or there two things that we could potentially plot.

  • m, which is the intercept.

  • Oh, sorry! b which is the intercept.

  • And m, which is the slope.

  • So linear regressions are very useful, because you

  • want to say, hey, as age goes up by x, how much does y go up?

  • Or in this case, if age increases by 4.25 years--

  • as age increases by 1, height increases by 4.25 centimeters.

  • And that has a p value of less than--

  • a tiny, tiny p value of 2 times 10 to the negative 16.

  • So that's way below pretty much any alpha

  • which tells you that there is 100--

  • there is almost-- it's statistically almost impossible for there

  • not to be an actual significance here.

  • COLTON OGDEN: Right.

  • That makes sense.

  • ANDY CHEN: So there's also the intercepts,

  • which is sort of like remember it's a lie, and it doesn't make a lot of sense

  • in the real world, but it's saying, if your age is 0,

  • your height is going to be 1.43 centimeters.

  • Doesn't make a lot of sense in terms of being an actual baby.

  • Right?

  • But that's just how it a regression works,

  • so you're fitting a lie to real data.

  • So cool.

  • I think that I think that kind of finishes

  • off our conversation for today.

  • COLTON OGDEN: Yeah.

  • That was pretty cool.

  • Getting to see the fact that you get all these nice graphical tools as well.

  • Not only just being able to model the data

  • see the variables that matter to you but also model them visually.

  • I like seeing things visually.

  • I think that's important.

  • [? Fatmo's ?] saying, "I think all of CS50 is awesome, they train well.

  • Andy being a star among them.

  • ANDY CHEN: Ah, thank you so much.

  • COLTON OGDEN: "Colton's [INAUDIBLE] virtual office hours is unique."

  • It's a fun time.

  • I'm glad everybody's joining in and having fun.

  • This is my first exposure to R. So thank you very

  • much for coming on to the show and--

  • ANDY CHEN: Yeah.

  • Absolutely.

  • COLTON OGDEN: --educating us on R and RStudio.

  • This is pretty cool.

  • ANDY CHEN: Yeah.

  • I needed to refresh myself.

  • It's been a while.

  • COLTON OGDEN: Yeah.

  • Yeah.

  • I know I can imagine.

  • Some of the function calls get pretty bulky there.

  • ANDY CHEN: Yeah I don't think I could've done this without notes.

  • COLTON OGDEN: Oh, yeah.

  • It'd be tough.

  • I can imagine it being tough.

  • Yeah, this is great.

  • Thank you so much.

  • And then everybody who wants to follow along, or join

  • after the fact, R and RStudio are free.

  • So as we talked about earlier in the chat, definitely-- early in the stream,

  • and in the chat, definitely grab those, and mess around.

  • Hopefully the sandbox is up and running with R function,

  • as well, in the near future.

  • This is back to the firewall screen.

  • If anybody has any last questions before we wrap up the stream,

  • definitely let us know.

  • We'll stick around for just a couple more minutes.

  • This week is the CS50 hackathon.

  • So tomorrow is the hackathon, Andy and I will be there.

  • Going into Friday, because it's an all nighter.

  • So we start at night and then it goes until, I would say 5:00 in the morning,

  • or 6:00 in the morning.

  • ANDY CHEN: 5:00 In the morning.

  • COLTON OGDEN: We go to IHOP.

  • Yeah.

  • That'll be great.

  • Is it your first or second hackathon?

  • ANDY CHEN: This is my first hackathon.

  • COLTON OGDEN: OK.

  • Nice.

  • ANDY CHEN: When I took this class, I took it in Kenya.

  • COLTON OGDEN: Oh, OK.

  • Yeah, you could have had to vicariously sort of be a part of the hackathon.

  • ANDY CHEN: I saw the--

  • I liked the videos.

  • COLTON OGDEN: Yeah, no, we have the hackathon.

  • So we won't be streaming this week.

  • We will stream next week.

  • So we have a stream with Nick on Tuesday.

  • And then I'll probably do a stream on Wednesday, is probably actually the day

  • that I'll do a stream.

  • We'll finish up Space Invaders.

  • And I think Monday, we have another stream lined up.

  • I need to check my calendar just to 100% verify

  • the stream schedule for next week.

  • But I do believe that, that is what we have set up.

  • ANDY CHEN: Nice.

  • COLTON OGDEN: Ba,ba, ba, ba, ba.

  • ANDY CHEN: Ba, ba, ba, ba.

  • COLTON OGDEN: Yeah.

  • So next week we have David and I on Monday for the surprise.

  • So I'm not going to spoil that.

  • And then on Tuesday, Nick will be joining us for a C basics tutorial.

  • And then Wednesday, I will be--

  • we'll finish up Space Invaders.

  • So yeah, that will be a great time.

  • [INAUDIBLE] saying, "This is really cool.

  • Hope to catch more of these.

  • Stats can be boring, but you make it fun."

  • ANDY CHEN: Ah, thank you so much.

  • If you call it--

  • here's a trick.

  • If you call it, data science, people think it's really awesome.

  • COLTON OGDEN: Yeah.

  • There you go.

  • And they pay more right?

  • ANDY CHEN: They pay you way more.

  • But it's sort of the same thing.

  • COLTON OGDEN: says, "Thank you Andy, and Colton, very interesting stream."

  • Yeah, thanks for tuning in.

  • Thanks so much.

  • "I was going to opt out today, but [? Babbick Night ?] efforts to make it

  • made me stay tuned."

  • Yeah, so thanks everybody, helping each other stay tuned in.

  • "Love the surprises," says [? Azley. ?] Yeah, me too.

  • This will be a great stream.

  • So yeah, thanks again, Andy.

  • ANDY CHEN: Yeah, of course.

  • COLTON OGDEN: We'll have you again, probably in the spring at some point,

  • or [? J term ?] time--

  • ANDY CHEN: Sure.

  • Yeah.

  • I'll be around.

  • COLTON OGDEN: --to follow up stream on something,

  • whether it's stats related or otherwise.

  • Cybersecurity and data science, let's catch followers.

  • Says [? Fatmo, ?] "Is there any first steps class

  • you suggest, or is there any intro class that

  • has a professor with good reviews?"

  • ANDY CHEN: At Harvard?

  • Or--

  • COLTON OGDEN: Maybe just any.

  • ANDY CHEN: So edX, Harvard X has an offering called Stat100x,

  • [INAUDIBLE] because I worked on it.

  • COLTON OGDEN: There you go.

  • ANDY CHEN: It's pretty intense.

  • It's an introduction to probability theory, which actually

  • isn't what we're doing today, but it's the background for why

  • what we're doing to today, works.

  • If that's what you're interested in.

  • COLTON OGDEN: More rigorous?

  • ANDY CHEN: Yeah.

  • It's very difficult.

  • COLTON OGDEN: You'll have to check that out.

  • What was the name of it again?

  • ANDY CHEN: s Stats 110x on edX.

  • It's probably offered by Harvard Stat 110x.

  • COLTON OGDEN: OK.

  • ANDY CHEN: Professor is Joe [? Blitstein. ?]

  • COLTON OGDEN: Check it out.

  • ANDY CHEN: Yeah.

  • COLTON OGDEN: You learn some stats.

  • Actually, [? dive ?] into a little bit at some point.

  • ANDY CHEN: Yeah.

  • I forget all of it.

  • COLTON OGDEN: Maybe some R. Some Python data science stuff sounds pretty cool.

  • ANDY CHEN: Name some Rthon.

  • Is that a thing?

  • Rthon.

  • COLTON OGDEN: Rthon?

  • Maybe, I'm not sure.

  • Integrate RStudio with Python.

  • "Thanks for the stream.

  • I want to ask the logic behind the p value,

  • but I know it's the end of the stream so its cool

  • if you don't have time to explain it.

  • ANDY CHEN: The logic behind the p value is

  • we assume that everything that we sample, all events,

  • happen along a distribution.

  • In this instance, we assume it's a normal distribution.

  • And you p value is saying, what is the likelihood

  • that the actual result we get is--

  • so if you think about the normal distribution.

  • When we think about the area under the curve as the possible region where

  • conceivably, that event could happen.

  • The likelihood of that happening--

  • there is a very small area on both sides of the tail that

  • are extremely unlikely that cover 5% of that area,

  • and that's actually two standard deviations from the mean.

  • And we're saying that we're OK, essentially, if our sample--

  • if our results actually were in that area.

  • It's saying like, the likelihood of it being in this area

  • is so far gone, it's 5%, that we're OK with it.

  • But what that also means, is every 1 out of every 20 times

  • you do it you're going to get not a great result. But yeah.

  • COLTON OGDEN: And tune in for the stats course

  • for probably more of the background on that, I'm guessing, right?

  • ANDY CHEN: Yeah.

  • COLTON OGDEN: Someone asked, JumpJump123, "Could you

  • say the course name again?"

  • ANDY CHEN: That would be Stat 100x.

  • COLTON OGDEN: And then, [? TwitchHelloWorld, ?] aka,

  • Jacques or Jack, says,"What did you say is the stream Professor Manning is

  • doing Monday?"

  • And that is a surprise.

  • We're not going to spoil that one.

  • Another surprise, you're going to have to tune in.

  • Cool.

  • ANDY CHEN: Nice.

  • COLTON OGDEN: I think that was it.

  • We're going to adjourn now.

  • It's been a little over two hours.

  • Thanks again, Andy--

  • ANDY CHEN: Yeah.

  • Of course.

  • COLTON OGDEN: --for coming on today.

  • ANDY CHEN: Thanks for having me.

  • COLTON OGDEN: It was great.

  • It was great fun.

  • Great time seeing R. I've always heard about it.

  • Never actually seen it too much in practice.

  • ANDY CHEN: [INAUDIBLE].

  • COLTON OGDEN: Physically seen it.

  • But it was a great exposure for me.

  • Have a great rest of the week, all.

  • I think we'll probably post videos of hackathon related activities,

  • and pictures, and whatnot on the Facebook group.

  • And I will see all of you next Monday.

  • ANDY CHEN: That's right.

  • Have a great weekend in the meantime.

  • COLTON OGDEN: All right, everybody, on that note,

  • let's have a great rest of the week and weekend.

  • See you next time.

  • ANDY CHEN: Ciao.

COLTON OGDEN: Hello world.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it