Subtitles section Play video Print subtitles COLTON OGDEN: Hello world. This is CS50 on Twitch. My name is Colton Ogden, and today I am joined for the first time by-- ANDY CHEN: Andy Chen. Nice to meet y'all. COLTON OGDEN: So Andy here-- tell us a little about what would you do here on campus, which you're involved in. ANDY CHEN: Sure. So I am a master student studying bioinformatics. I'm also a special student in computer science, and I actually work at HarvardX, so if you guys are familiar with the online learning platforms of Harvard, that's one of the offices that has really good resources. COLTON OGDEN: I feel like you-- didn't I-- I met you in the spring, I think. You came to the fair for the Supreme Court. ANDY CHEN: That's right, yeah. COLTON OGDEN: And I think you were talking about something like that, yeah. Pretty exciting. And what are you going to talk about today? ANDY CHEN: Well today we're going to talk about a programming language called R, and one of the things you can do in it which includes biostatistics. COLTON OGDEN: Oh yeah. ANDY CHEN: Cole, you might ask me what is biostatistics? COLTON OGDEN: What is biostatistics? I actually-- ANDY CHEN: So it's really statistics in the field of like biological data. But a lot of people use it in the context of epidemiology, as opposed to more like molecular biology kind of things. And that's actually what we're going to be dealing with today. COLTON OGDEN: That's diagnosing diseases, right? Epidemiology. ANDY CHEN: Epidemiology is sort of the study and the practice of response to the spread of diseases. COLTON OGDEN: Got it, OK. Makes sense. ANDY CHEN: Right. COLTON OGDEN: I clearly don't-- I'm not an expert on biology or biostats. ANDY CHEN: But you will be soon. COLTON OGDEN: Yeah, I'm very excited. We have a lot of people in the chat that have joined us, and were talking before we started a little bit in advance. Thank you very much to everybody who's joined. Regulars of ISO TV. There's a new regular, Asley, Newanda33333, belacures, m.kloppenburg, thank you for joining. Let me make sure I didn't miss anybody up above that. Techytack, hello. [INAUDIBLE] and fatma, thank you for joining the regulars and everyone she's saying. Really curious about about this one says m.kloppenburg. This is the first time we've had anything kind of statistics related onstream. ANDY CHEN: Oh, exciting. COLTON OGDEN: Python's obviously a language that's very often used in bio or in stats, generally speaking. But R kind of like the language that people I think, maybe most people associate-- or at least they associate starts with R, and then R also sort of with stats that end with Python, too. I don't know anything about R, so I'm actually very curious to see what it looks like, what the environment looks, what we can do in it. I think we've caught up on all the comments. Everybody's saying hey Andy, nice to meet you Andy, everybody saying, so you got a lot of friends in the chat there. Yeah, so thanks so much everybody. Let's go to your screen here, so we have your screen set up. And why don't you get us started here. ANDY CHEN: Sure, awesome. Thank you very much Colton. Hello everyone, hello friends from all over the world. So R, like Colton was saying, is one of probably two languages that are very popular for statistics or data science kind of things, Python being the other one. Today we're going to be looking at R, which let's go to the website. So bring up a browser if you will. The first thing we're going to be doing is installing the language itself. Now notice that we actually are not going to be working in R, which on Mac OS 10-- well actually I don't know what I'm running, but whatever-- on Mac you have to install R the language itself, which actually I think does have a command line interface. But we're going to be working in R Studio, which is an integrated developing-- developing environment? COLTON OGDEN: Integrated development environment. ANDY CHEN: Development environment, thank you. COLTON OGDEN: It's a mouthful. ANDY CHEN: It's a mouthful. I'll just keep calling ID. COLTON OGDEN: ID. That's why we call it an ID. No one wants to say all those words. ANDY CHEN: But yeah, so we're going to be installing R, which is the language itself, as well as R Studio, which is the IDE in which we'll be working. COLTON OGDEN: What are the links that we can go to, and I can toss them in the chat as well. ANDY CHEN: Great So the first one is going to be www.r-project.org. COLTON OGDEN: OK. ANDY CHEN: The second one is going to be rstudio.com. And the last one-- COLTON OGDEN: The former being the language itself, the latter being the IDE, the R IDE that you're alluding to? ANDY CHEN: Exactly. COLTON OGDEN: OK. And babicnight also in the chat, and Andre Jacob Johnson, and Irenae, thank you very much for joining us, everybody. Well, some more regulars. And babic, to answer your question, not late at all. We just started. We're now tossing in some links into the chat for downloading R and RStudio. So r-project.org and rstudio.com. ANDY CHEN: Thank you. So what we're going to be doing today is working with the M Heinz data set, which is actually kind of difficult to-- it's freely available. It's a US governmental data, but it's actually hard to parse its raw format, so what we have provided today is a .text file, which I've uploaded to this link. I don't-- COLTON OGDEN: We can make a bitly for it. So what is the-- do you want to email me the link, and then I'll toss a bitly into the chat. People can click on it, and then get access to it later on YouTube. ANDY CHEN: Absolutely, yep. And then so while I'm doing that, let's see. Let's get to email. Oh, man [INAUDIBLE] piling up. COLTON OGDEN: If you want I can go and go here, so people can see your personal email. ANDY CHEN: Oh, yeah. Some people like that. COLTON OGDEN: Lots of juicy tidbits in there. Everybody just go ahead and look through Andy's email. Yeah, we'll get to it we'll get a bitly for everybody in the chat, if you want to email that to me. ANDY CHEN: I think it should be sent. COLTON OGDEN: OK, and we refresh. Just one sec, everybody. Sorry for the delay, but this will be a lot better than typing a super long mega upload link. OK, so here I go I got the link. I'm making the link. I'm going to copy it to bitly. If it wants to cooperate. Copy it over to bitly, paste it in there. Get rid of these stupid messages. ANDY CHEN: They're just try to show you love, Cole. COLTON OGDEN: A little bit. Aw, crap. OK. Here, we're good. We got this. And edit bitly, get a a copy. Can customize it? I can't. So we're going to call this bit.ly/biostats_stream. And so that will be how it works. Clear all these messages, save that. It's going to-- I'm going to copy that. I'm going to go to the chat, and I'm going to paste that in. So now if you go to this bitly url. So it's a bit.ly/biostats_stream. And let me-- and Asly says, Andy is such a Hufflepuff with a heart emoji. ANDY CHEN: Thank you. COLTON OGDEN: So bit.ly-- let me make sure this is working-- /biostats_stream. Yep, it works perfectly. ANDY CHEN: Good. And if anybody is curious, go to bit.ly if you want to have a really long url, and you want to shorten it down, you can do that at bit.ly. Bitly, as it's called. COLTON OGDEN: Indeed. OK. ANDY CHEN: I think we have a few comments. Nuwanda333, I am a Hufflepuff. Thank you, I think. I think it's a good thing, right? COLTON OGDEN: I think so. I think the Hufflepuffs are-- I actually don't know what the-- ANDY CHEN: I think they're the catch all. COLTON OGDEN: I think they're the friendly-- like I actually honestly don't know too much about it. ANDY CHEN: OK, well I'll take it. COLTON OGDEN: All I know is that they're friendly. ANDY CHEN: Sure. And then in response to TwitchHelloWorld. So I am doing my masters in bioinformatics. COLTON OGDEN: And did you say you're doing with NCS? ANDY CHEN: I'm a special student in the graduate school of Arts and Sciences, which means I am sort of a visiting student within the university. COLTON OGDEN: Mm. OK got it, got it. Makes sense. Only three megabytes, says faceless voice [INAUDIBLE] surprised. The data set is roughly small. ANDY CHEN: Yay. It's just text. COLTON OGDEN: No big data today. This is small data. Sort of. ANDY CHEN: It's approximately 10,000 entries, if I recall. COLTON OGDEN: That's actually pretty sizable. ANDY CHEN: Actually I think it's exactly 10,000 entries. COLTON OGDEN: Oh, wow. Next we'll find the number. ANDY CHEN: We'll find out. COLTON OGDEN: Forces me to install a stupid Chrome add on. ANDY CHEN: Oh, don't do that. It's so-- actually let me see if I can show what it should look like. Do not install the Chrome add on. It's-- here. So can we-- OK great. So if you hit it once, and it should, da, da, da-- I don't know what's going on. No, no, you don't want this. Do not do this. This is-- OK. I think-- there we go. So you want it actually-- COLTON OGDEN: I guess you got to click on it twice. Just don't install the add on. The upload is a little bit shady. ANDY CHEN: It's a little-- I wouldn't trust it. Those kiwis-- no, I'm just kidding. COLTON OGDEN: Trying to get some ransomware on everyone's computer today. Make a little extra money on the side. ANDY CHEN: That's actually what they pay me for. It's my real job. But all right, let's get back to R. The first link on r-project.org is downloading the language itself. All right, so what we're going to do is under the Getting Started section we're going to go to Download R, this link right here. We're going to click it. And then, so these are-- COLTON OGDEN: We maybe we want to command plus a couple of times, just so we can see a little bit. It's a little small. ANDY CHEN: Let me see if I can-- yeah. Is that better? COLTON OGDEN: Yeah, this should be pretty good I think. ANDY CHEN: Cool. So CRAN is the Comprehensive R Archive Network, which is a bunch of mirrors for where are different distributions of R are stored. And so I'm going to go to one that is closest to me. It doesn't really matter, but what, Massachusetts. Probably CMU. Pennsylvania, I think that's the closest one. Well, if we go back. CMU Pittsburgh. Pennsylvania is pretty closest. COLTON OGDEN: It's probably, it's pretty close. ANDY CHEN: I mean doesn't really matter. COLTON OGDEN: It doesn't matter too much. If you're-- maybe if you're abroad it might make a little bit more of a difference, a download speed difference. But yeah, I choose the mirror most appropriate to you, to your country. We do have a lot of people tuning in from all over the world, which is always super awesome. [? RobertSpiri ?] thank you for joining me. We are doing some biostats in R with Andy Chen. ANDY CHEN: Hello. A newcomer to the stream. And we just got everybody sort of situated with the data set. So if you're not equipped with it yet, you can go to this URL. bit.ly/biostat_stream. ANDY CHEN: That's a great title. COLTON OGDEN: It works. It wasn't taken, thankfully. No one's done a biostat stream. ANDY CHEN: This the first ever. COLTON OGDEN: First ever in history. ANDY CHEN: Wow, I'm into it. I feel like I should like frame this moment, put it over my bed. COLTON OGDEN: I think you should. But let's get back to the R before I get too distracted. So once we've clicked on an appropriate mirror for you, I am going to download R for Mac OS, because that is the operating system I'm running. And then it's-- we are I think 351 should work, but we'll go with an older version, just to be safe. I'm going to do 333. To download that. So on the Mac OS that download at .pkg, which I think actually is like sort of a custom installer. So it's currently downloading, and then once that's done downloading, I'm going to click it. I'm going to double click it. There we go. COLTON OGDEN: Go through all the steps. ANDY CHEN: And then just-- if you want to be particular about it you should feel welcome to. It-- oh, so one thing that is actually very attractive about R in the industry is it is a commercially-- it's a free open source software. So unlike a lot of commercial statistics, some industries that do allow statistics will prefer certain languages like Stata or SaaS, but R is popular in certain industries, and in academia because it's free. COLTON OGDEN: Makes sense. And hence maybe why it's becomes so popular as of the last few years. ANDY CHEN: Absolutely. COLTON OGDEN: Facelessvoice in the chat is saying, what is biostats? ANDY CHEN: Yeah, that's a really good question. So the way that using the term today is referring to applied statistics with biological data. And we talked about this earlier in the stream, that could technically be used in the context of molecular biological data, but today we're actually going to be looking at epidemiological data. Which to reiterate, is the study and the practice of response to disease and its transmission. COLTON OGDEN: OK. Makes sense. Aren't the majority of languages free, says facelessvoice? ANDY CHEN: So Stata and Saas are not free, I don't think. COLTON OGDEN: Like Matlab I think has an expensive license. ANDY CHEN: I think so, yeah. COLTON OGDEN: Most languages are free, I would probably say. I think it's also the environment is usually a big part of it, too. ANDY CHEN: That's true. The environment. COLTON OGDEN: But in the context of bio stats it sounds like-- in the context of statistics it sounds like there are relative to maybe the rest of CS, some languages are environments that are not free. ANDY CHEN: That's-- yeah. COLTON OGDEN: That's kind of why making the case here is important. ANDY CHEN: I think that's absolutely true. And as we'll see and a little bit, I-- this is still installing. Great. We have time to talk. Ooh, no we don't. Statistics is sort of-- the way that you can use R is almost like a giant calculator. Which is different from certain programming languages, which is why it's very popular in certain-- in industry and statistics, because you can just plug it in and plug it out without having to think about like scripts or scary CSE kind of things. COLTON OGDEN: Makes sense. [? Ahmet Osman ?] says, this is an awesome stream by chance of participating in an MIT hacking medal. Do giving you advice or recommendations? ANDY CHEN: I actually I think I was thinking it apply to that, but I missed the deadline. The last MIT hackathon I went to was a VR/AR AR/VR hackathon? Advice. Are you local? It would be helpful to know if-- do you have any specific questions on what kind of advice you would like? Because if not my first thing is just have a lot of fun and make a lot of friends. And get swag. COLTON OGDEN: Also shout out to the invisible can of seltzer. ANDY CHEN: It's the color of my face. COLTON OGDEN: You can put it in front of your-- actually [INAUDIBLE] because all it does, it shows the background of the grass. Because you're already looking at a green background, but if that was for example like a red background, it would probably be a little bit easier to see that it's invisible. [? Twitch Hello World, ?] is it also applicable to epidemiology in terms of treatment and not just spread? ANDY CHEN: To epidemiology in terms of treatment and not just spread. Hm. I suppose you could probably get some interesting data out of-- well, actually I'll say it this way. I think a lot of epidemiologists will say that the treatment requires understanding the situation, understanding the context, and the only way to do that, or one of the best tools we have to do that, is through statistical analysis. And so treatment-- in the real world epidemiologists have to face sort of issues of, this might be the best medicine, but is it cost effective? Can we get it there in time? There's a lot of logistics involved. And if you have statistical data about how the disease is spreading, where it's spreading, and what actual demographics are being affected, you can make really good logistical and business decisions that will maximize the medical impact that you do have. And so in terms of treatment, not in the development of treatment, but absolutely in the execution and the decision making of what treatment is probably best. COLTON OGDEN: Makes sense. Biostats is equal to epidemiology plus statistics, definition from [? Vert ?] [? Lu's ?] school. ANDY CHEN: I-- you know what? I like that definition, yeah. I think all definitions need to be a little fluid, because people use them in different ways in different contexts, and they evolve over time. But I like that definition. COLTON OGDEN: Andre, what would you say are the biggest advantages of R over Python? ANDY CHEN: Hmm. I actually am much, much more into Python than I am into R. But the biggest advantage of R over Python? Oh, piping. You can do lots of really cool things with piping, which is like sort of feeding processes. It's kind of hard to explain, but I think we'll talk about it later. And the other thing is, I think R is a little more approachable to people who are-- Python is a little more-- it's actually less abstract, but it's a little more similar to traditional computer science-- like programming language and environments-- to the point where I think a lot people are uncomfortable getting into it. It's like, oh, computer science. Whereas R is very-- the GUI is very much as we'll see-- the GUI for R studio is a pretty familiar environment to at work in. You can just use it like a calculator. COLTON OGDEN: That makes sense. Great. ANDY CHEN: So lets-- I downloaded. Oh, R is installed. So if I check R, it's installed. Great. So, well actually let's open that. So R itself does have a command line environment if you want to work in it by itself. But I don't know where it went. COLTON OGDEN: I think it came up and then instantly-- ANDY CHEN: It died. COLTON OGDEN: Yeah. ANDY CHEN: Uh-oh. I hope that's not-- that's not even working. OK well, we'll get to that when we get to that. Let's go to RStudio. All right, so again this is the integrated development environment in which we'll be working with the R language, and to install RStudio we're going to go to this link, which is again rstudio.com, going to choose Download under RStudio. So this is RStudio right here, and we're going to choose this link here, which is Download. All right, and we're going to choose the RStudio that's top open source license, and as Colton was saying earlier, some of the IDE insert languages do cost money to use, and RStudio is a common-- is a popular one because it's also free, sort of. Depending on your usages, as you'll see here. COLTON OGDEN: Yeah, it looks like they do have different licenses, different commercial licenses and whatnot. They get pretty expensive. ANDY CHEN: Yeah, $30,000 a year. COLTON OGDEN: That is expensive. ANDY CHEN: That's about how much I dropped my boats every month. I wish. COLTON OGDEN: Making that sweet stats money. ANDY CHEN: Although if you call it data science, slap data science on there, you make a lot of money that way. COLTON OGDEN: Yeah, pretty much. ANDY CHEN: All right, so it brings you down to here, and so again I'm running Mac OS. So I'm going to download that. And then this is more of a traditional-- at least I don't know how the distribution is for the other operating systems, but this is more of a traditional Mac type installer. So it's just like, it comes up, and then you drag and drop into your applications folder, and then it bounces an image I think. Once it's done downloading. Great. COLTON OGDEN: Yeah, looks good. The DMG? ANDY CHEN: Yeah, DMG, exactly. COLTON OGDEN: If you're on the Windows, there would probably be something somewhere that-- they'll probably have an installer, an MSI that as an application of some Program Files folder. ANDY CHEN: So that's in there. COLTON OGDEN: But then they make it easy, is what it looks like. ANDY CHEN: Right, yes. COLTON OGDEN: Relatively. ANDY CHEN: Oh, I guess I'll check some other way. Ooh, my secrets. COLTON OGDEN: Well your text messages. ANDY CHEN: Uh-oh. COLTON OGDEN: A deep dive in your text history here. ANDY CHEN: Ruh-ro. COLTON OGDEN: OK, [? Ahmet Osman ?] says I'm Egyptian living in Saudi, and they're coming here and I just got the invitation. Actually today after 12 hours going to start. The advice I'd be interested in is, how could the stream benefit me in the context of health care buisness. I'm also an MIT Enterprise forum competitor for 2014, and hell yeah I was a lot of fun. ANDY CHEN: Nice. OK, let's see. [INAUDIBLE] I got the invitation [INAUDIBLE] 12 hours. Oh OK, so it's coming soon. What would be [INAUDIBLE] how this [INAUDIBLE] in context of health care business. This is hacking medicine, right? I think unless you have a very specific niche in mind, a topic or a field that you want to go into in health business, probably the single best thing that you can get out of a hackathon like this is just the network. Right? Spend as much time as you can-- well obviously focus on your project, whatever your hack is. But meet cool people. People come there because they're smart, they're passionate, they're driven. And so there are very few opportunities in life where you can kind of have in like what, 24 hours, meet potentially a hundred people who are really interesting, really smart. And if they don't have specifically what you need, they might later on, or maybe they know someone who can. But unless you-- the other thing is, if you have something you are very specific in, I would look up at the list of sponsors, list of speakers, and try to be very strategic about the resources that are available and the people there are there that you can talk to try to get into whatever in health care business. Or health care business you get into. COLTON OGDEN: Great. Good response. Jack [INAUDIBLE] saying it's inexecutable for Windows, but it download froze at 100 percent. That is unfortunate. I would maybe just try it again, probably. I've had that happen to me on Windows a couple of times. And Chrome. Chrome will say that the download is going, and then when you gets 100 percent it will just kind of chill for a while. But yeah. That would-- for me as well. Let it be. Sometimes you have to wait it out a little bit, and then it'll save the larger files and Chrome into a-- yeah exactly. [? Babbick Night ?] just said it will complete, and then you can install. ANDY CHEN: (SINGING) Let it be, let it be. COLTON OGDEN: Just don't divorce too many ex-wives, that gets expensive says [? Twitch Hello World. ?] ANDY CHEN: Oh, nice. That's good advice, keep it in mind. All right, so speaking of [INAUDIBLE] a few minutes while people catch up, I think. It's-- so I can open this up. So this is what RStudio looks like. COLTON OGDEN: Does it get any bigger by chance? It is a little small. Might be small on the screen. ANDY CHEN: Yeah, I can do that. COLTON OGDEN: Beautiful. ANDY CHEN: Well actually that's real nice. COLTON OGDEN: That's great. ANDY CHEN: Yeah. So RStudio has-- let me actually I don't like this full screen. COLTON OGDEN: You need option plus it'll expand without actually going to full screen mode. Or option, click the plus. ANDY CHEN: Way out here? COLTON OGDEN: The green plus, yeah. Hold Option and click that green plus. ANDY CHEN: Oh. COLTON OGDEN: No, this green one up here. No, yeah, that one. ANDY CHEN: Thank you, that's a good-- that's a good hack to know. COLTON OGDEN: I had the same issue with another application. I forgot what it was, and I wasn't having any of the full screen. Oh, and I forgot to shout out to all of the people to that followed now and before the streams. Let's do that really fast. So we have Notice, and actually it looks like [? Digleen, ?] [INAUDIBLE] [? Conciliated, ?] [? Stadium'91 ?] you [INAUDIBLE].. We have Alaska Ukraine, Newtown kings, savage X factor. Tono A 30, and Kate@00. Thank all of you for following today. ANDY CHEN: Yeah, thanks for coming out. COLTON OGDEN: Quite a good number of people. ANDY CHEN: If I had Thanksgiving leftovers, I would share it, but I ate them all. COLTON OGDEN: Oh, yeah. ANDY CHEN: In my tummy. COLTON OGDEN: You did the right thing. What do we see, we have a blackjack counting card. ANDY CHEN: I don't know what that is. COLTON OGDEN: Steve [INAUDIBLE] sent a blackjack counting cards link and see it looks like high-low. High-low be like the classic phrase. Yeah. Cool. ANDY CHEN: Nice. By Robert Springer. COLTON OGDEN: Oh, Robert Springer. Gotta run. Keep up the good work. Thanks Rob for tuning in. Hopefully catch it on YouTube. We'll see you next time. And good, it sounds like Jack Welch got the download working after all. Yeah, sometimes chrome is weird like that or whatever browser on Windows. It just takes a couple seconds. ANDY CHEN: To finish up. COLTON OGDEN: Not sure why, but, you know. Such is life. ANDY CHEN: Great. OK, so now we've opened up RStudio, which is our integrated development environment in which we'll be working with R. Great, so let's familiarize ourself a little bit with how the actually works. So you actually have your console here, terminal if you want to do some command line stuff. LS, CD. COLTON OGDEN: That's like your actual terminal for Mac versus the console being like within RStudio. ANDY CHEN: Correct, yeah. I think it's-- yeah. It's wait. I forget how command line works. Well anyways, it's there if you want to use it. COLTON OGDEN: Beautiful. ANDY CHEN: And then, so over here this is a sort of-- and I don't know how to describe this little panel here, but this is where all of your values, all your variables will be stored. Not stored, it will be displayed. So you can actually look at them. I'm going to get rid of that, because you're not supposed to be able to see that yet. The environment is empty. And so there's a little broom brushes right here, and we'll let you clear stuff. If for example I had some variables in here that I just cleared, which I just did. Let's say-- COLTON OGDEN: They've abstracted it into a brush? ANDY CHEN: Yeah, it's very-- COLTON OGDEN: Scoop all the garbage out there. ANDY CHEN: It's very high level. It's not machine code, it's human code. COLTON OGDEN: Yeah. ANDY CHEN: So let's say that I have a very messy console, right? It's messy, and I clean it. Oh look, it's clean. For those of you who like clean consoles, that's a very nifty-- COLTON OGDEN: Clear consols are nice. ANDY CHEN: Clean consoles are very nice. COLTON OGDEN: Clear and terminal as well, shout out to Clear. ANDY CHEN: Oh, really? Just clear. Oh, dude, I'm getting all the tips, Colton. COLTON OGDEN: We had we had a Linux command stream. Nick? ANDY CHEN: Yeah, Nick. COLTON OGDEN: Right, yeah we had a lot of juicy tidbits in there. I don't know if Clear was one of the things that-- does Python have something similar for the environment window? ANDY CHEN: It depends on what your IDE is. I usually work in Jupiter and notebook, which I think does actually. I just don't know what command is in Jupiter. COLTON OGDEN: I don't work too much in those. I know PyCharm is-- people love PyCharm. I'm not if that has something similar as well? ANDY CHEN: Mm, I'm not sure. COLTON OGDEN: That's getting a lot of popularity in the Python community, if it hasn't already for a long time. ANDY CHEN: Right it's-- I mean that seems like it's a pretty useful thing to be able to do. So I imagine a lot of popular IDE's have that available. Windows defender scanner-- oh. COLTON OGDEN: Oh, yeah. Yeah, good point [? Jacob ?] about the-- or [INAUDIBLE] about the Windows Defender. That would make sense. It just wants to make sure you're not giving everybody viruses. ANDY CHEN: They got me. COLTON OGDEN: Step by step by step, every step of the way. [INAUDIBLE] I'm guessing PyCharm are the ones you talked about. ANDY CHEN: OK. Well actually I think there's a faceless-- no. There's an health interest in patients with learning disabilities and autism. Mm-hmm. That's not my field, but the fact that you have something narrow and specific, that means that you should definitely-- if that's something you want to be working in, look up the list of guests, the list of speakers, list of companies who are supporting, and also maybe if there's a list of other people who are joining, and then try to reach out to them as much as possible. And then brainstorm some ideas. COLTON OGDEN: And [INAUDIBLE],, thank you very much for following. ANDY CHEN: Hi. All right so, this is our studio, and the first thing we're going to be doing is importing the data set. So actually I should probably give a few words on the data set we're working on. COLTON OGDEN: Oh, yeah definitely describe what it is, and the full name of it too. ANDY CHEN: Absolutely. So NHANES, is the National Health and Nutrition Examination Survey, which is a very long running and I think it's an annual survey that covers actually a lot of different things. But it's a CDC study that happens pretty much every year. There's NHANES one two and three, which cover different things, but if you explore the website-- let's see. About NHANES, blah, blah, blah, blah, blah. You can have access to what the questions are the way that the interviewers actually got this information from their patients, but so there are a lot of things that you can learn about. So anemia, cardiovascular disease, diabetes, environmental exposures, eye diseases, et cetera, et cetera. And yeah. So it's readily available on the CDC website, however the way to actually use the data-- so it's readily available, but the way to actually get into the data is a little bit confusing, which is why we're using a pre-rendered .text, nhanes.text file, because in R-- so the reason it's difficult to parse is, a lot of these data, these files, come out in Sas, which is the language you talked about earlier, in Sas format which we cannot use in R. So that's why we're using this dot text that we have in the mega upload link. COLTON OGDEN: Cool. Makes sense. ANDY CHEN: Sure so-- COLTON OGDEN: Do you have a specific field you're interested in applying to, Andy, says [? Twitch Hello World. ?] ANDY CHEN: Oh, hm. Yeah, that's a really good question. I think-- so the work that I support in the lab that I work in is regenerative biology. And so I suppose I would be interested in going into regenerative medicine as a field. COLTON OGDEN: Like stem cells? ANDY CHEN: Yeah. Exactly. It's part of the Harvard Stem Cell Initiative, or Institute. HSCI. A lot of I's. COLTON OGDEN: Get those limbs grown back. ANDY CHEN: It's absolutely. If you guys are interested, you should look up axolotls, A-X-O-L-O-T-L. These are tiger salamanders and anbystoma salamanders that if you cut their arms off, they go right back. And I think parts of their hearts and their tails and parts of the brain. COLTON OGDEN: Soon to be human DNA. ANDY CHEN: You heard it here. You heard it here first, folks. COLTON OGDEN: That would be pretty cool. ANDY CHEN: All right, so I think. Nice, bro, thanks. I'd definitely like to friend you. How could I get your contacts? I think if you email one of us, or email Cole later on, he can probably put you in touch. Cool. And then [INAUDIBLE] says I opened it, now what. If you tell me-- I'm not sure what you opened, but if you give us a little more information we can probably try to help troubleshoot. COLTON OGDEN: Maybe the R Studio. Maybe they're thinking-- ANDY CHEN: Oh, now what we do. Oh, sorry we got distracted by the comments. Let's finish up some of the comments, then we can get into R. Soon to be on wolverines. True. COLTON OGDEN: Yeah, that's true. Then we will-- that's the whole goal, the whole motivation. ANDY CHEN: That's the only reason I'm doing it. Let's be honest. Do you do brain regeneration? I've been helping stroke rehab. Oh, that's really awesome. So I don't personally study any brain regeneration, and actually most of the stuff I do is computational. But there is a lot happening in the field of brain regeneration in zebrafish and axolotls, so there are definitely a lot of faculty professors out there who are studying that. There was a [INAUDIBLE]. OK, cool. [? Heard ?] something on edX and I should have instead ted-- Oh, TED. Well, I got rep edX because they're right above us. OK, so back to the IDE. So this is what the IDE looks like when you open it. So now what we're going to do is we're going to open the NHANES data set, and the way-- there's actually two or three ways to do that, but in terms of the actual interacting with the GUI there's two ways to do that. So the first one is, under this environment pane right here there is the import data set drop down menu. And so we're going to click that and then we're going to import a data set from text, and then base is the second part of that. And so we're going to navigate to where our NHANES is, and so it should come up with an import data set window that looks like this. And then what we're going to do is we're going to do heading, we're going to check yes just to make it pretty. So if you notice this is what it looks like if heading is yes. The actual heading's in the file itself, and here's what looks like if it's no. In which case V1 is not nearly as apt as ID. So we're going to do that. And then we're going to import. COLTON OGDEN: Makes sense. ANDY CHEN: Yeah. So there you have it. We actually-- in this window that popped up above our console and terminal, we actually have the data itself visualized in the IDE. COLTON OGDEN: It's effectively sort of turned into Excel. ANDY CHEN: Essentially, yeah. It's a spreadsheet inside of your IDE, which is one of the reasons why R-- well RStudio's popular is you can work your statistics and have access to the spreadsheet that you're working with. Which I think you can probably do in certain R-- or Python IDEs, but most are not I don't think designed to do that. COLTON OGDEN: Makes sense. ANDY CHEN: So let's actually take a walk through the NHANES data set, and just check out what's inside. So IDE is the idea of the patient. That's not super interesting to us, but-- so we have a survey year. So I think all of these might be 2009, 2010. Yeah, because these are-- it's NHANES is an annual survey, or roughly an annual survey. I think it's to us and time that I might be wrong. We also have gender. Well probably a more apt term in usage common parlance today would be sex, the physical sex of an individual. Age and then bracketed into age decade, which we'll talk about why they did that. We're going to talk about a little bit about the kinds of variables that people use in statistics, because that determines what kind of statistical tests you will run those on when you're trying to figure things out from the analysis. Age months, race, education level, whether or not they're married, their income. So there's a lot of socioeconomics in it, too. Even like the number of rooms in your house. It's kind of an interesting data point to have. Whether they're working, their weights. So socioeconomics, physical characteristics, BMI. So we're getting a little bit more towards some health characteristics, some disease statuses. Sati? Statuses? COLTON OGDEN: I think statuses is-- ANDY CHEN: Statuses? COLTON OGDEN: I think if we're doing official Latin it would be statiae, but I think that statuses. ANDY CHEN: Oh, is this supposed to be in Latin? I think I didn't get the memo. COLTON OGDEN: Yeah, the rest of it's going to have to be in Latin. Sorry. ANDY CHEN: Pulse, et cetera. There's actually a lot. So there are-- we'll see exactly-- well actually right here there are 10,000 entries, it says. But I'll also be showing you a line that will tell you how many rows there are. Blood pressure, systolic, diastolic. Testosterone levels, direct cholesterol, the volume of your urine, something I always want to know of course. Diabetes, the days with bad mental health, depression, number of pregnancies, babies, alcohol consumption. There's a lot. So the what I'm trying to say is, NHANES is a not comprehensive, but is a very, very wide breath data set that you can actually-- if you're interested in learning about parsing this data to look if there are any trends that you want to learn about, it's a really good data set to start with. COLTON OGDEN: Yeah. Looks like there's a lot of fields in there. ANDY CHEN: A lot of fields-- marijuana, age of first marijuana. COLTON OGDEN: Definitely, the more information you have is more useful obviously, than having less information. ANDY CHEN: Absolutely. That's what all data scientists will tell you. Hard drugs? Yes please! That's for the consumption of have you ever consumed hard drugs, I imagine. And then some sexual activity. And yeah. So that's all the different things we have here. So this is the enhanced data set. So that's one of the ways to import it. Let's clean this out right let's say that we're starting over. COLTON OGDEN: Asa had a good question. She said the first three IDs are the same. Was the same patient tested thrice on different occasions? ANDY CHEN: Ooh, that's a really good question. Well, let's open it again first. The other way to do it is File, Import Data Set from Text Base. Great. And again, we want to check heading. All right. That's a really good question. It looks like it's repeated. So if you look at actually all the other data, it is exactly the same. So like for example, the number of rooms is 6, 6, and 6. 9, not working. I think this-- yeah. So the first individual-- COLTON OGDEN: It's the same ID too. So maybe you would put this-- like, you would sort of make a set of the data where every idea is different. ANDY CHEN: Right. So that's probably really good idea. There appears to have been a replication error here. COLTON OGDEN: And asking it's actual people's data, right? L-o-l. ANDY CHEN: Yeah, no, this is publicly available. It's off the CDC website. COLTON OGDEN: I heard that China created genetically modified babies recently. ANDY CHEN: Yeah, like yesterday. They talked about like, two days ago, one of the scientists had someone CRISPR, which is a gene editing technique. One of the, I think, probably like, T cell receptors for an HIV virus out of a-- they CRISPRed it out of a baby's genome so that they don't have there-- So the way that HIV works is it-- the immunologists might get on me, but it's a virus that attacks one of the cells that's critical for it to the human immune system. My understanding of what happened with that China baby case is the father had HIV or AIDS. And then so the scientist CRISPRed, which is a gene editing [? talent, ?] CRISPRed the gene for one of the cell receptors out of the genome so that the baby's immune system cells can actually-- the HIV has no way of getting inside the cell. COLTON OGDEN: Interesting. They have to do that when the baby is like practically like, just after being a zygote. Because otherwise you'd have edit-- ANDY CHEN: A trillion, like, a billion cells. COLTON OGDEN: So they do it when it's just impregnate-- or the woman is just impregnated probably. ANDY CHEN: That's probably very, very early on. Actually, I have no idea how they did it in humans. I don't if it was in the woman herself, or if it was more like a test tube baby situation. COLTON OGDEN: Yeah, that would make sense. That would make sense. That would be easier. In vitro probably would be really difficult. ANDY CHEN: Yeah, it probably would be. Well, I don't know. I don't [INAUDIBLE]. COLTON OGDEN: Like, for the woman, I feel like it could be difficult to be constant operated on. If there was like, repeated follow up. ANDY CHEN: Right. Plus like, there was like, sort of health questions about-- I don't know how you would isolate the baby specifically inside of-- COLTON OGDEN: It would be rough. ANDY CHEN: That actually brings up a good point. That's one of the reasons why gene therapies are sort of questionable, is because like, they generally work on the scale of single cells. And so if you're trying to do provide gene therapy for an adult, that's a lot of cells that the retrovirus [INAUDIBLE].. COLTON OGDEN: If I were a scientist working in that, I have no obviously, information about that or context. But I would imagine it'd be easier just to start from the very beginning and get just the sperm and the egg, and then manipulate those cells. And then those cells would then replicate. And then the therapy that we've provided to the original cell would then propagate to the other cells. ANDY CHEN: Exactly, yeah. COLTON OGDEN: But I'm no expert. ANDY CHEN: Yeah. That's how that would work. It's much easier if you start earlier on. Anyways, I think we have some question. Gattaca becoming real? Gattaca is becoming real. It's true. COLTON OGDEN: I actually don't know what that is. ANDY CHEN: It's a Battlestar Galactica. It's a SyFy series. I think there's like very similar to human robots that are-- or something. The gattacas I think, are the-- I mean, I might be getting this wrong. I've never seen it. Enhance is a dot text. COLTON OGDEN: Do we need to hook you up with a power supply probably? ANDY CHEN: 58%? I think I'm OK for now. But I do have one in my pack in case I need to get it at some point. COLTON OGDEN: OK. If you want to continue, we're probably running low on battery. You can keep going and then just give me your power supply, and I'll plug it in for you right now. ANDY CHEN: Sure. It's in my blue bag in the main folded under the jacket. Cool. So now that we have opened [INAUDIBLE] and we can sort of look through the data here, let's just use R as a calculator. Because that is one of the reasons it's popular, RStudio is popular is because it's easy to use. It doesn't look exactly like you're coding hard core. It's easy for-- ooh-- COLTON OGDEN: What happened? ANDY CHEN: It's a black screen. COLTON OGDEN: Did your computer go to sleep? ANDY CHEN: No, I don't think so. Well, there we go. Thank you so much. COLTON OGDEN: [INAUDIBLE] I believe. ANDY CHEN: Teamwork makes the dream work. Very nice. COLTON OGDEN: I'm not sure why your screen went black. ANDY CHEN: Yeah, I'm not sure. Maybe it just hate me. COLTON OGDEN: That's probably it. I think you figured it out. ANDY CHEN: You can use like, a giant calculator. For instance, Colton, off the top of your head, what is the product of 933 times 186? COLTON OGDEN: If I got that right, that would be amazing. ANDY CHEN: Yeah. Do it, do it, do it. COLTON OGDEN: 900 times 100 would be-- what would that-- That would be 90,000. So I'm guessing like, 126,000 something. ANDY CHEN: Give me some random digits in there. COLTON OGDEN: 1, 2, 6-- 1, 2, 6, 1, 4, 4? ANDY CHEN: 1, 2, 6, 1, 4, 4. I mean, in the same order of magnitude, 170,538. COLTON OGDEN: I'm terrible at that kind of math. ANDY CHEN: So I actually sometimes I use R for my homework for my problem sets because if I don't have a pen or paper, I can like, put it in here. I can remember it. So we're doing this in console. Should I do this in a script? What do you think? COLTON OGDEN: You do what-- you multiply 16 by 16. [INTERPOSING VOICES] ANDY CHEN: Can you do it Colton? Hold on. Block the screen. Block the screen. COLTON OGDEN: Is at 100 and-- wait. No, no it's not 196. What is it? Because what's 16 times 6? What's 10 times 6? [INAUDIBLE] So what? 60 plus 96? Would it be 254? ANDY CHEN: 256, dog. You're close. You're close. I did it off the top of my head. COLTON OGDEN: I got-- clearly. Yeah. So in the console, we can actually run R commands here. But we can also actually-- New Script-- we can run it as a script in the folder up here. So as you notice on the screen, we just open up a new thing right here, a new R script. And so if you want to do 16 times 16-- and we so we wrote it. And then we run it, it'll print 16 times 16 down here in your console. COLTON OGDEN: So it's kind of like Python in that you can execute line by line exact script. ANDY CHEN: And I think that's because it's interpreted, not because it gets compiled. COLTON OGDEN: Yeah, that makes sense. ANDY CHEN: So for instance, I do my homework here sometimes. So let's say like, a product is 82 times 93. Later on, I'll be like, oh I want to actually know 40 times the product. So I'll be like, oh I don't know, but I can do 40 times product and then run it. I'll have to run from the top. So one thing that you should know about the script is you have to write it from the top. It is not stored in local memory. It needs to run before it happens. If you are used to using Jupyter Notebook, it's really similar to that. So we run. We stored 8,293 into a variable called product. And so product is now in memory, which is we talked about it before here where our local variables are. Or I guess-- in the context of this, [INAUDIBLE].. COLTON OGDEN: Yeah, just your environment. So whatever your current-- ANDY CHEN: Things. I don't know if you would call it a global local variable. COLTON OGDEN: There's like, a frame. I think it's called a frame. And it basically just whatever all the-- it's global environment. So I imagine it's got its own global variables in this context? ANDY CHEN: I think so. Yeah, that would make sense. Let's go with that. So let's say that later on my homework, like, oh! Now I need to take 40 times product. I don't remember what product was, but hey, it's stored here. So I can get one product. It's 305,040. Who knew? R did. COLTON OGDEN: Yeah. R has got the hookup. ANDY CHEN: I got the hookup. COLTON OGDEN: And thanks to Twitch Hello World saying I'm a good sport, [INAUDIBLE] for saying Colton is super smart, though. He just can't multiply. And I'm actually disappointed, because now I realize that we did ask, though, 16 by 16 in the chat. I feel like it's something a programmer should know, just off of their head now. 16 times 16, because 256, right? I feel like that's something that I have to have that memorized. ANDY CHEN: I definitely would not be able to do that yet. COLTON OGDEN: Corrugated Drop, thank you very much for following. We're doing live mental multiplication on stream. It's great. ANDY CHEN: Yeah. All right. So let's see. So let's oh actually-- so the third way to load data set is as follows. In your script or in your console-- because they are sort of functionally a [INAUDIBLE] in the sense that what other commands you write in console you can also do in scripts. The difference is scripts, you can run over and over. It's being stored as like, as a program, as a script. So the other way that we can load a data set is I'm just going to make a variable called NHANES. And I'm going to store it as read.dlim. And then whenever the location of my data set, which is as follows, if I do that, then what has actually been done is I've stored NHANES in a variable called NHANES. So let's try that again. Let's say that my look environment is empty. If I run this command-- whoops. It does not like that. That should work. I wonder why it's not working. It's a period here. That should be a-- COLTON OGDEN: And it should be Downloads, right? ANDY CHEN: Ooh, good catch. Maybe? Yeah, there we go! And so it pops up. Thank you, Colton. COLTON OGDEN: I try, you know. That's what friends are for. ANDY CHEN: Teamwork. Teamwork brought to you by teamwork, the official drink of teamwork. COLTON OGDEN: Yeah, exactly. ANDY CHEN: So as you'll notice, the NHANES data site pops up on your right, right here. So that's the third way to import a data set. So now that we have our data set here, let's-- and so earlier, I was mentioning on how it might be useful to know how big your data set is. And so one of the really easy ways to do that is there's a command called NRow. If we call NRow, and then pass in NHANES as its argument, when the return's done-- so I've saved it. I keep forgetting. Now we have to actually run it. In our console it returns NRow is called. NHANES is passed in as the argument. And it returns 10,000. COLTON OGDEN: So it just basically show for number of rows? ANDY CHEN: Number of rows. Yeah. So it's a very good-- if you forget like, how big your thing is, it's a very useful command for that. COLTON OGDEN: Cool. Great ANDY CHEN: All right, so Colton, you were talking earlier about making a subset of data that would be interesting. So I suggest that today. And not a suggestion. This is the only way we're going to do it, because it's the only thing I have notes for. Let's make a subset of the data for just pediatric patients. So conceivably, can you think of a situation where I would like to know a certain data from the NHANES population, but I only care about knowing it and in children? COLTON OGDEN: Check for age? So check for age is less than 18? ANDY CHEN: Yeah, exactly. So the way we do that-- whoops. We're going to go back into our script. We're going to take a new variable called NHANES pediatric. And then say we're going to call a subset, which is a function that subsets a data set into a new data set. And then it's going to take in the original NHANES as an argument. And then we're going to give it these parameters where the age-- so what this line is doing is we're making a new variable called NHANES pediatric. We're going to call subset and passed in NHANES original data set as the n argument with the caveat that we only want entries rows individuals whose age, which let me open NHANES here-- it's actually a column here-- and age were here-- the value of it is equal to or less than 18. And so that makes someone a pediatric. COLTON OGDEN: OK, makes sense. ANDY CHEN: So let's run it. And oh look! Right here popped up a new data set in our global environment. And so we have successfully substituted the NHANES data set to the point where we're only looking at pediatric individuals. COLTON OGDEN: And then Faceless Voice is saying could you remove the duplicate IDs? Do you know off hand how to do that? Is there a set [INAUDIBLE] off of a key? ANDY CHEN: You know, there's a way to do it, but I-- let's think about it. OK, so. We could run that again. So where ID is 5, 1 is not equal to-- I wonder of that'll work. It'll get rid of the original. COLTON OGDEN: It'll get rid of all of just the 5, 1, 6, 2, 4's, but that won't actually make it a set. I'm curious if there's like, an R set function. That would be pretty handy. Sets. Make set with order. There's a lot of documentation. Let me see. R set on key. I feel like that would be-- there's so many function called like, set key. What's the purpose of setting a key in data.table? ANDY CHEN: Every time I do a deep dive into the docs, I die a little bit. COLTON OGDEN: How to perform a cumulative sum of the unique IDs only. ANDY CHEN: So Twitch Hello World, is this similar to Excel? There are a lot of things that you could do in Excel that you can do in R. But there are arguably more things you can do in R than you can do an Excel. R is a-- some people call it a statistical language, because it's very useful to perform-- it's very easy and useful for statistical analysis. But a lot of things you can do in Excel, but R is more powerful in the sense that it's more low level, and you can implement your own functions. Although you can do that with macros in Excel. But I think it's more powerful, it's more low level. COLTON OGDEN: Yeah they wanted to figure out how to get rid of all the duplicate ideas. Yeah, if you don't know the function offhand, we can maybe forge ahead and then-- it looks like it's somewhat hard to-- oh, maybe the unique functions [INAUDIBLE] is saying? R unique function. Oh, OK. So what Unique will do is it will actually just get rid of all duplicate rows, which I think will work for R use case. ANDY CHEN: Yeah, I think it does. Yeah. COLTON OGDEN: So just call yeah, I guess, unique first. You'd be like, NHANES unique, because you probably want to do it after the-- or I guess you want to on that. Yeah, that works. ANDY CHEN: Yeah, either way, I think it's would probably be OK. OK admitted [INAUDIBLE]. Let's look at how many rows there are. COLTON OGDEN: Oh, but you probably want to assign it to a variable too, right? ANDY CHEN: Yes. Into itself, you think? COLTON OGDEN: Sure, yeah. ANDY CHEN: So let's run that. Cool. OK. So let's actually-- COLTON OGDEN: So then you can print n rows on NHANES pediatric and then before you do the function, yeah. ANDY CHEN: And then that will tell us how many individuals. So we should see at least three fewer. COLTON OGDEN: So thanks Bella, for tossing that in the chat. Magus 503 says it also duplicated. So I guess that will give us how many are the same? ANDY CHEN: Are the same? That's a good one to know. Thank you. COLTON OGDEN: It looks like it's printing out for some reason. ANDY CHEN: Yeah. But that's because I haven't saved it to a variable. COLTON OGDEN: Oh, I see. Gotcha. ANDY CHEN: So if we did not call unique, then what happens is we have 2,628 entries. And if we do end up calling it, we have 2,246 entries. COLTON OGDEN: OK, nice. So there's 400 duplicates. ANDY CHEN: That's quite a lot. COLTON OGDEN: Duplicates. ANDY CHEN: Duplicitous. What does duplicitous mean? COLTON OGDEN: Good question. I actually don't know. ANDY CHEN: Like serendipitous? COLTON OGDEN: We can-- we can use the good old dictionary app. Duplicitous, deceitful. ANDY CHEN: Oh really? Oh! Oh, that makes sense. COLTON OGDEN: Treacherous, duplicitous. ANDY CHEN: Treacherous! COLTON OGDEN: The Vocabulary stream as well. ANDY CHEN: Yeah, we're learning lots of things today. Great. So now we have made a subset with unique individuals. Very nice. OK. COLTON OGDEN: And we did it live. We figured it out live, even better. Even better. Thanks to the Twitch stream for shouting it out for us. They got our back. They always got our back. ANDY CHEN: We always got your back. Oh actually, so this is interesting. In statistics in general, if you have a data set and you have information that is what you think erroneous, or it's absent-- like, for instance, let's say you have an individual who didn't give an age or a gender or a BMI et cetera, you have to consider is there a reason for why that data was omitted. Because there might be an underlying factor there and that is actually causing that data to be admitted. That's just something that you have to be considerate of when you're performing biostatistical analysis. So for instance, let's say that-- so the sum function is interesting. It's going to count how many times the argument occurs. And I'm going to say is .na, which is saying the thing is not a thing. It's confusing, but I'll explain it in a minute. Let's look at NHANES pediatric. And then we use a dollar sign, which is to parse into and say hey, which variable are we looking at? I think most of these are going to be not empty. So blood pressure systolic av. That's some kind of blood pressure thing. So my guess is that there's going to be some empties here, but let's run it. Yep. So sum is it counts how many things happen of the thing that's inside. And what we put inside is, is .na. And what this is saying is like, the thing itself, if it is absent, na, like not applicable. And then what is not applicable? In the data set NHANES pediatric, specifically for BP sys av. So what that means if we're going to visualize it, is it goes into this NHANES data set. It goes into the variable called-- what do we do? Is this [? b sys av ?] or something? COLTON OGDEN: Yeah, it's one of those ones that's kind of abbreviated. [INTERPOSING VOICES] Oh, no, you had BP sys av. ANDY CHEN: BP sys av. Oh, thank you. This one. So some of these have entries in them. But you'll notice that, for instance this one has an na in it. It doesn't have anything. So what I was alluding to earlier is in statistics in general, you should consider why your data set might have an empty there. Just because it's empty doesn't mean that you should admit it, because there might be a very interesting reason and underlying factor for why it's being admitted. We're not going to consider that too much today. But if you go further on into doing biostatistics or statistics in general, you definitely need to put thought into why you're omitting data if you decide to omit. COLTON OGDEN: Makes sense. Osire18 says I'm a biochemist out sick from work. Thank you cssatv for some [INAUDIBLE]. ANDY CHEN: Welcome back to the-- so the world of bio stuff. You can't escape it. I hope you feel better. So this line in line 11 is it's counting up each of these rows. So one, two, each of these individuals, this person, this person, this person, this person. And it's saying in its data set for the variable BP sys av, does it say na in it? If so, add to sum. And then so when we've called it, it returns 1,022. So in our particular data set, our subset of data set for pediatric patients or individuals, 1,022 of them have an empty in them. Interesting. I don't know why, but the reason why might actually be important. So something to think about. COLTON OGDEN: And it look like they used the dollars sign kind of Excel syntax there too. ANDY CHEN: Oh, is that Excel syntax? COLTON OGDEN: They have like dollar sign for like, the row. ANDY CHEN: Oh, that's right. Yeah, that's right. COLTON OGDEN: And so that kind of reminds me of. It's nice that you can index into this the sheet like that. ANDY CHEN: That's the word to use. Index, yeah. COLTON OGDEN: Fancy CS words here, you know. Indexing. ANDY CHEN: My life is zero indexed. COLTON OGDEN: There you go. ANDY CHEN: How many people are in this room? COLTON OGDEN: Uh, yeah, one. ANDY CHEN: Very nice. OK. So let's actually do some interesting things. So I suggest that we not look at BPS sys av, because as enthralling as BPS sys av might be, I think it might be easier to look at some more tangible data, some more tangible variables. So let's look at age, because it's easy to understand what age is. Let's also look at gender, or as it's saved in gender. But what we would probably call sex in common parlance-- and BMI. So let's run all of those. OK. So none of the patients have age missing. Great. Zero patients here. And then gender, none of the-- I should say individuals, not patients necessarily. None of the individuals are missing gender. So each of the individuals in the study have an assigned or provided or were assigned a gender. COLTON OGDEN: Because it's like, a mandatory field probably for the survey. ANDY CHEN: Exactly. Right. Yeah, that makes a lot of sense. Because I could I could see how blood pressure might not be the easiest thing to get from a baby. And so that is probably would be easy to leave blank, which is what we were talking about before. Is there a reason that this is blank? Maybe it's because it's a baby. Maybe most babies are difficult to get blood pressures from, in which case that's an interesting fact of itself. But gender and age are pretty straightforward to get. And the last thing we're going to look at is BMI, which, is everyone familiar? Should I talk about BMI do you think? COLTON OGDEN: Does it pertain to the-- ANDY CHEN: What we're going to do. COLTON OGDEN: If it does, then sure. ANDY CHEN: It's body mass index, I think. It's just it's a measure of someone's health in terms of if they're obese or skinny or overweight. It's not super accurate, but it's been historically used. And so that's what we're going to be looking at. It ranges from a scale I think, 0 to like, 60 or something. But just think about it as a continuous range. That's the most important thing to think about. COLTON OGDEN: It's the percentage of adipose tissue to non-adipose tissue in the body. ANDY CHEN: Oh, is that what it is? COLTON OGDEN: Is it? Body mass index. ANDY CHEN: So I think it's actually more accurate than what it is. I think it's just a proportion. You take the ratio of your height to your weight, and then assigns you some index value. COLTON OGDEN: Oh, you're right. It's the ratio mass versus height. I thought it was a ratio of adipose to non-adipose. ANDY CHEN: That's really specific. COLTON OGDEN: That would be very hard to calculate super easily. So yeah. I guess weight to height makes more sense. So that's easier to calculate. And [INAUDIBLE] saying 38 plus 2 is that we're referring to the people in the chat room. No, it was a joke Andy was making about in this physical room, there is one-- because get it? Because zero and then one is two in zero indexed. ANDY CHEN: In computer science, most languages are zero indexed, which means they start counting at zero. So you wouldn't say one two people. There's zero one persons. Let's see, what else? It's giving me error in unique NHAMES. So, it's not NHAMES-- NHANES with an N. Maybe that's your error. But not sure. COLTON OGDEN: Yeah, that would makes sense. ANDY CHEN: I think that's probably what that is. Should we assign the data to NHANES [INAUDIBLE] line first? I don't know which line they're talking about. COLTON OGDEN: I think [INAUDIBLE] is responding to [? Babbick. ?] ANDY CHEN: Oh. Ah! COLTON OGDEN: And [INAUDIBLE] is also responding to them. I think [? Babbick ?] may have missed the line where you load the data set. Do you want to bring that back up? Just [INAUDIBLE] that line? ANDY CHEN: This one? COLTON OGDEN: The very first line where the very first data set gets loaded using the read.delim function. ANDY CHEN: Right here, yeah. COLTON OGDEN: And then we're assigning that into a variable called NHANES. ANDY CHEN: Correct. That's right. COLTON OGDEN: And then Steve is asking what R packages are you using? ANDY CHEN: So we've actually loaded none so far. But we will be using ggplot2 in a little bit to visualize some data. That's probably the most common data visualization package in R. COLTON OGDEN: Are you able to check if there are other columns that have the same sum as BPS have? ANDY CHEN: I can conceive we think of writing a script that can do that. If you are interested in that, I might not be very efficient, but the thing that comes off the top of my head is to have a for loop to run that for each of the variables there are, and then to store in some kind of data structure, and then see if there are similar. If you do it in the equivalent of a dictionary, then you would just take the ones that are the same, and then you would take the key and do whatever the key is in it. What's the delim thing? The delim thing is actually just the syntax for loading a data set. So if I were to actually import data set and do this whole step from the beginning. It would do the same thing here but I've just manually done it through a command. Great. So the interesting thing here is some of our pediatric individuals, 275 of them actually don't have a BMI. So that's interesting. But again, we're just sort of going to ignore that for the purposes of this demonstration. But in real life, you should think about is there a reason for why these BMIs are empty, because that could actually be an interesting underlying reason. COLTON OGDEN: Makes sense. ANDY CHEN: Before we go too much further into R, we should have a conversation about the kinds of variables in statistics. So what is different between a variable called gender and BMI in terms of what the possible answers could be? COLTON OGDEN: Gender, in this model where we're going based on biological sex, and it's basically zero or one, it's almost like a Boolean, but in this case, it's just a very limited set of options. ANDY CHEN: That's exactly right. COLTON OGDEN: And then whereas the BMI would be a floating point value that could range between 0 and some-- basically, they handle different ranges of potential values. ANDY CHEN: Of potential value. That's exactly right. COLTON OGDEN: In this case actually, one is a different type of data. One's a floating point value versus the other one is an enum. ANDY CHEN: Very nice. The answer I was looking for, which is more of the statistical approach to it, is male versus female are the possible answers for gender, as it's used in this data set. Male and female are the only two possible answers, which means that it, like a Boolean, which can be true or false, it's either this or that. There's a name for this kind of variable in statistics. And it's called a dichotomous variable or a binary variable. COLTON OGDEN: Makes sense. ANDY CHEN: And so that's related to something else called a categorical variable, which is sort of like, what color is this chair? COLTON OGDEN: So that would be an enum, categorical variable. In programming. Like red, blue, yellow, green from a limited set of options. ANDY CHEN: Enum-- So those are categorical variables. These are things that don't have numbers in them per se. But BMI, because it is a range of any possible value between 0 and whatever infinity-- actually, I don't think BMIs go up that well, but it's a range of BMIs-- [INTERPOSING VOICES] Yeah, go get on the treadmill or something. I'm approaching that after Thanksgiving. And so the contrast there is right. These are two very distinct kinds of variables. One of them is number based, and one of them is sort of categorical. And so the way that you perform statistical analysis on these variables depends on what kind of variables they are. And so today if we have time, we'll talk about a few different kinds of situations where the dependent and independent variables are categorical or numerical. And then we'll talk about the different techniques that we can use to analyze that if we have time. COLTON OGDEN: What's the Dlim thing? ANDY CHEN: Theoretically and practically, does R handle and unlimited amount of info in a data set? I think it's a lot lower usage, like, processor intensive than Excel is. So it's probably exponentially better at handling data than Excel. But unlimited, no, because you'll overflow. --two versus a lot. OK. BMI of space objects. I like it. Very nice. So the first thing we're going to look at in our particular essence is gender is a binary or dichotomous variable. And we want to compare that against a continuous numerical variable like BMI. And so the right tool to use in this particular instance, the resonance of a tool is probably two separate populations, male and female, and then suddenly medical quantity. The first test that comes to my mind in statistics is probably a T test, which is a comparison of means. And so there are certain requirements, sort of assumptions that using a T test requires. We're not going to talk about them today, but if you're interested in statistics, it's important to think about whether to use a parametric test or a non-parametric test, parametric being usually stronger, but having stronger requirements. At the basis of the law of statistics is the assumption that data lies on some kind of naturally occurring distribution. A lot of times it's a double or a Gaussian distribution, like a bell curve. Statistics is really looking at what is the likelihood of good data I'm seeing being true, like, occurring, given that the background distribution in reality is probably something like this. And so-- I lost track of where we are. Oh, T tests! So we're going to use a T test to compare gender and BMI. And so the next step here is let's check to see if there are any non-answers in our pediatric data set for BMI. So we actually did check that. So if you want check on the opposite direction, this is-- remember that is.na is saying it doesn't exist but exclamation point is.na is it does exist. So let's put these two together, run that and run that. So we have 275 individuals that don't have BMI and 1,971 individuals that do have BMI. So if we sum those together-- 25 plus 1,971. What it returns to 2,246. And as we saw earlier above, that is exactly how many individuals we have. So we're just double checking to make sure that everything is good. COLTON OGDEN: Makes sense. ANDY CHEN: And so the next thing we want to do is-- let's make a table. So R is more useful than Excel arguably, because it's very easy to make visualizations, although they're a little bit more confusing to make. So like in Excel, let's say you want to make a table comparing what is the average for males and what's the average for females? You can do it, but you have to go insert plot, and you have to choose the data et cetera. And in R it can be very powerful, because it's just one line to do that. And so the way you would do that is Table. And then you would pass in as an argument the data that you're looking at, which is, yeah-- and then we're going to look at on the basis of gender, exclude equals false here is a parameter that is saying if there are any empties, then show that there empties that exist. And so it turns out there are 1,087 females and 1,159 males in your data set. There's nothing here to the right because there are no empties. But if we choose BMI-- which there are empties-- and we have this exclude equals false argument, what it's going to show-- ooh. That's not what you want. Oh. The reason this is happening is because BMI is not a binary variable. Table, you should probably only use with binaries. Anyways, h? No, h definitely is full too. COLTON OGDEN: Well, you're doing it off of pediatric data though. So you do have at least 1 to 18. So it's visible on one line. ANDY CHEN: Yeah, that's true. So this is actually interesting too. So the way to interpret this graph here is how many entries are for an individual at least 1-year-old, 2-years-old, 3, 4, 5, 6, 7, up to 18. Because it's again, it's pediatric data. There are 121 individuals who are less than one year of age and 93 who are exactly 18 years old. And so table is a very strong command that it lets you do that in just one line, whereas to do that in Excel would be a few more steps. COLTON OGDEN: I like how easy it is to sort of lay the data out that way. ANDY CHEN: Absolutely. And so I think some people were asking earlier is this Excel or like, what are the advantage of this over Excel? Is it can be very powerful. It can be very fast. Just one line like this, and this would take you probably at least 10 times as long I think, in Excel. Maybe if you were really good, like, a little faster. COLTON OGDEN: Yeah I'm not that great. So it would probably take me long. ANDY CHEN: Yeah, me neither. COLTON OGDEN: Steve [INAUDIBLE],, is there a way to handle such data in C? Theoretically unlimited data sets? The way that I would imagine would be just stream the data, and then replace the same memory with that data. Because you really have a finite amount of memory where you can store information. So if your goal is to parse information-- the same sort of information-- you're going to want to basically have a chunk of memory that you write to, do some zero operation on that, and then replenish it with new information. And it's similar to what you see in games with object pools and stuff where you use the same objects over and over again. Because if you spawn an infinite object, you'd run out of memory. So that's kind of the same idea. You would keep a limited chunk of memory that you populate with the data that you want. And then you just overwrite that data as you get new input from your stream, basically. We're going to do a stream next week on C, lower C with Nick. So maybe we can cover something similar to that. And Bounty Hunter Ridley, thank you very much for following. ANDY CHEN: So this is the syntax for doing the subset from the regular NHANES into the NHANES pediatric. All right. So the next step is let's actually do some interesting stuff. So you're going to load our first package. And so we call library function. And ggplots2 is the package we want to run actually, we have to install that first. So install that packages ggplot2. [INTERPOSING VOICES] COLTON OGDEN: -with the actual script too? Like, you're not doing anything necessarily in advance. You can just call and install the packages? ANDY CHEN: Exactly. COLTON OGDEN: There's probably a way to do that in Python as well. ANDY CHEN: Yeah, I think you can. COLTON OGDEN: --had an occasion to do it. And Osire, thank you for following. You posted the chat earlier. ANDY CHEN: So we're just installing the ggplot2 package, which is-- COLTON OGDEN: It's a lot of stuff. ANDY CHEN: It's a lot of stuff. COLTON OGDEN: It's always fun. It's always fun going through like, if you've ever installed a node package or whatever, NPM, and go through like, all the same packages. Because they all have like, a million subpackages. But to go through and see what all the different individuals subpackages are. Sometimes you have to do that if you have to debug an old NPM project that has a deprecated function or something, you'll have to do that sometimes. And it can be kind of a pain. ANDY CHEN: That sounds like a lot. COLTON OGDEN: It's fascinating going through and sort of digging your way-- sort of following the trail of crumbs back like, as low as you can get in node, which admittedly, is high level. ANDY CHEN: No, it's cool. It's cool, to quote David, looking under the hood to see what's happening there. It's still installing n. It's a lot. COLTON OGDEN: It would be so cool if we deduced the cure for a disease in the streams [INAUDIBLE]. ANDY CHEN: Yeah, Colton, you got that? [INAUDIBLE] COLTON OGDEN: So no. I don't even know how we would begin to do that. We might have to call in the chat for that one. ANDY CHEN: I think that's on you guys. COLTON OGDEN: Call in the ringers. ANDY CHEN: And girls. COLTON OGDEN: Hello Colton. Welcome to the new [INAUDIBLE]. Actually our souls are synchronized working on data set, but on C# project. What are you doing, guys? Just [INAUDIBLE] just genius. Thanks, Goson, for popping in. Welcome to the new-- welcome to Andy, the new friend ANDY CHEN: Hang low, hang loose. COLTON OGDEN: Yeah, that's cool. That's cool that you're doing data stuff in C# as well. Data is just popular in so many environments right now. R and Python are the environments that you typically associate with stats, I feel like. At least the ones that I hear about. But I'm sure that people do it everyday in everything. ANDY CHEN: Yeah, like assembly. COLTON OGDEN: That would be terrible. ANDY CHEN: That would be disgusting. [INTERPOSING VOICES] COLTON OGDEN: I would not want to mess with that. It would be fast, but I would not want to mess with that. ANDY CHEN: It will be very fast. I mean, C's probably fast. COLTON OGDEN: Yeah, C is fast. Could you talk more about variables while we wait? ANDY CHEN: Yeah, absolutely. So there are many, many, many subcategories and categories of variables. But in my mind, I distinguish between categorical, which are things that don't have numbers, and continuous variables, which are things that do have numbers. Although you can sometimes have categorical with numbers, in which case-- a common example of that is what's called a dummy variable, which is like, sometimes statistics languages aren't super smart in that you can't just tell it hey, compare this and this. It needs to know, hey, you need to compare one and two. And so people will sometimes plug in like male is one and female is two. And so you make "dummy variables" like that. But really, the distinction between variables is, is this a thing that like, exists in like, a natural number line, or is it really categories? And are you comparing categories to categories? Are you comparing numbers to numbers? Are you comparing numbers to categories? And depending on what you're actually doing, that dictates what kind of statistical test you want to use. And so what I was going to say, right now what we're-- I'm a little concerned. This doesn't usually take this long. COLTON OGDEN: You're downloading the entire-- ANDY CHEN: Oh God. COLTON OGDEN: --R database. ANDY CHEN: Uh! Anyways, well we'll see where that goes. In this particular instance we want to compare gender, which is a category, it's a binary category actually. So it's a it's a dichotomous variable of two populations, males and females, and a quantitative, a numerical variable of each. So BMI is a number. the average might be like, 15. The average for males might be 15. The average for females might be 16. And then what you're doing in a T test is a comparison of two means. And what it's doing is it's sort of like it's consulting the normal distribution and seeing how likely is the difference between these two numbers? How likely is that difference to actually exist along that normal distribution and so in statistics commonly, we will use things called alphas, which are sort of correlated with statistical power. But these are numbers arbitrarily chosen that we use as cutoffs for like hey, this is likely to happen. So we'll say it's statistically significant. And this is something that has to be done first. You have to state your alpha before yes, you do your test. Let's say my alpha is 0.05-- is a very common alpha, 0.05. What that actually means is in a normal distribution, the area of the curve-- so it looks like this-- the area of the curve that a value actually exists is 5% of the total area of the curve. In other words, that's approximately two standard deviations from a mean. And so we're using a T test, which is a specific instance of a statistical test that has the specific function-- at least the way I'm using it-- of comparing means between two populations. And let's say you have more than two populations, you would use something called an ANOVA, analysis of variance. Is that right? I think that's right. So you use something called an ANOVA, which is analysis of variance, which is another statistical test that compares the averages of multiple populations. So let's say I wasn't comparing males and females. I was comparing elderly, infants, and children. That's three categories. What is the average for elderly. What is the average for children? What is the average for infants and I'm looking at the statistical significance, seeing if there is a statistical significance between the means for each of these three categories, these three populations. COLTON OGDEN: Makes sense. ANDY CHEN: And so depending on what your actual data that you're looking at, what the variables you're looking at are, that dictates which tests you should use. COLTON OGDEN: [INAUDIBLE],, thank you very much for following. Elias in the chat, thank you for joining. He says hello. Andy are just practices 10x future speech. I was thinking FedEx, kappa. ANDY CHEN: Hey, FedEx. I like it. COLTON OGDEN: I like the table. Do you save it? Especially helpful tables, do you have it set aside somewhere? ANDY CHEN: Which table? COLTON OGDEN: I think just in general, just as a principle, do you ever save tables that you-- ANDY CHEN: Oh! That was a question. So I think you actually can do that. So right here in this environment panel-- you can't see it-- but right here where I'm circling, you'll select your data set. So I'm in NHANES now. And I think you can actually save it and export. So we'll save it as nhanes.text. Save. Ooh. COLTON OGDEN: Required.table. ANDY CHEN: Yeah. OK. Well I guess it saves it in R format. But if you open a desktop, yeah, I guess there's an R data set of n hanes. once I don't open that here. Save. Let's open. [INAUDIBLE] In my desktop. NHANES R.data, yes. I don't know where that loaded. [INTERPOSING VOICES] COLTON OGDEN: [INAUDIBLE] talking about like, the table function that you called as well? ANDY CHEN: Oh! You can absolutely save that as a variable. So I apologize. I think I misunderstood. COLTON OGDEN: That's all right. ANDY CHEN: Thank you. So table, if I run it like this, is just going to provide it here. But let's say I want to save that later. Table saved is a variable name that we will run that as. And then now it's saved as tables saved. And if you want to see it, you just run it again. [INTERPOSING VOICES] COLTON OGDEN: Shout out. David's actually in the chat saying hey, thanks for tuning in with Colton, Andy. Looks like everybody beat us to it. But shout outs the David. Thanks for joining us, for popping in. Did we miss other comments up there? Beverly's asking are we going to do confidence intervals? ANDY CHEN: Ah, confidence intervals. So that's actually very closely related to some of the summary statistics we'll talk about a little bit. I think for the purposes of this stream, confidence intervals, the way I think about it in the context of a T test is it's your mean plus or minus if it's a 95% percent confidence interval. It's the sample mean plus or minus two of the sample standard deviations in your sort of curve. And what it is, is it's a measure of how likely the sample mean that you got from your data is actually your population mean. And so one of the whole point of statistics is it's virtually impossible for most scenarios to know the actual summary statistic, for example, a mean of a population. And so the way that we try to approximate that is by taking samples, which is what NHANES is an example of. I don't know what the average age of all individuals in the United States is. I think that would be 330 million surveys. COLTON OGDEN: People to measure, yeah. That would be tough. I'm not sure-- ANDY CHEN: That would be very tough. COLTON OGDEN: I mean, I think the census may have that information. ANDY CHEN: That's true. The census might. But well, let's say for all of this, seven billion humans on the planet. COLTON OGDEN: That would be tough. ANDY CHEN: That would be very difficult. And so what statisticians do is they take a sample, which is like, a representative sample, which is to say that it actually sample sufficiently from all the different demographics that you're trying to actually analyze. And in probability in statistics, we have this idea of distributions, which is to say that whenever we perform samples-- or this is actually true for any events that happen-- we kind of assume that the likelihood of seeing a particular outcome, whether that's an average age of 15 or 16 or 17 or 69, you sort of assume that that number occurs on a distribution. And the one that we normally think of is a normal distribution or a Gaussian distribution. And it's a bell curve, which is to say that in general, most of your real events, your real samples are going to occur somewhere close to the middle where it actually is. And so that's sort of like the backbone of this kind of statistics. And I think earlier we were talking about probability, or the 95% confidence interval. The idea of a 95% confidence interval is it's trying to say what is the likelihood that the sample you got the, summary statistics you got from your sample, is actually capturing the actual summary statistic of your population. COLTON OGDEN: More data probably helps in that regard. ANDY CHEN: More data does help. Although depending on what kind of which statistician you ask-- COLTON OGDEN: And you have to make sure that you're sampling correctly too. ANDY CHEN: Absolutely. It needs to be representative. There are many assumptions in statistics. And so today I was going to talk about all the assumptions, just how to do it in R. But most of the tests I'm talking about are parametric test, which means they have a lot of requirements. A lot of assumptions that have to be met before you can actually use these tests so remember, we said we assumed that data follow some kind of distribution. That's not always a safe assumption. And there's something called the central limit theorem, which is this idea that in general, if your sample size is like, 30 or more, it's very likely to follow. You can assume that it follows a normal distribution. COLTON OGDEN: Enrique 8923, thank you very much for joining. Looks like I think it did finish installing. ANDY CHEN: I think it did, yes. COLTON OGDEN: GGplot2? ANDY CHEN: Ggplot2. [INTERPOSING VOICES] COLTON OGDEN: I think it installed more than two. ANDY CHEN: Ggplot3, am I right? Am I right? All right. So let's see if we can actually do some visualization stuff. COLTON OGDEN: The ggplot2, it's meant to make graphs, those sort of things? ANDY CHEN: Correct. COLTON OGDEN: Is gg short for graphing something? ANDY CHEN: I think one of the g's is grammar, graphing grammar, I think, plot? And the two is it's the second iteration of the package. COLTON OGDEN: Makes sense. Makes sense. Sample of data should be diverse says [? Babbick Night. ?] Big data does not equal good data, says [INAUDIBLE]. ANDY CHEN: That's true. It needs to be representative, absolutely. For instance, let's say I take a sample that's only in the city of Pittsburgh. I cannot say that whatever my findings are, are applicable to the United States in general. But if it were representative, if it were diverse-- if I took 30 people from Seattle, Los Angeles, Pittsburgh, Denver, Houston, then you could make the argument that it is representative, and whatever your findings are can be extrapolated too. COLTON OGDEN: That would be representative of urban environments, but not necessarily more rural environments. That's like, outskirted or remote cities or towns. So that would be something even more to take into consideration. ANDY CHEN: Absolutely true, yeah. COLTON OGDEN: I have seen lots of errors and inaccurate info in health care files and hardships in those patients getting these corrected, though they should be able to. This is an issue that comes up as awareness among biostaticians. I think it's unfortunate [INAUDIBLE] might come to conclusive results based on such. ANDY CHEN: That's a really good question. I actually don't work in industry. But from what I hear from colleagues who do or who have themselves heard about it, EMR, EHR, which are Electronic Medical Records, Electronic Health Records, one of the issues with them in their implementation in the United States health care system is that they're relatively new. We're in a state of transition from written to electronic health records. And part of the difficulties of that transition are getting health care providers to actually use the system correctly in that manner. And that it responds to is an issue and something that comes amongst biostaticians? I think that a good biostatitican is absolutely aware of those things. And he or she tries his or her best to ameliorate and work with the data as much as they can. Or if they come up with some kind of conclusion that they put the caveat, given the limited data, or like, this data can only be extrapolated to whatever that limited representation happens to be. COLTON OGDEN: Makes sense. JL97 finally caught the stream. Thank you for joining us today. Very nice. ANDY CHEN: OK. So ggplot loaded. So an example of a plot that ggplot can make is a histogram. This is actually going to take me a while to type out, because ggplot2 is very powerful, but it's not my favorite. It requires a lot of typing. COLTON OGDEN: A verbose library? ANDY CHEN: It's very, very verbose, correct. COLTON OGDEN: So I get that sense from like, map plot lib as well. ANDY CHEN: Map plot lib is like that too. Although map plot lib is more understandable to me. Like, for example, geom, like geometry? But you know, it could be so many things. COLTON OGDEN: Geometric something or other. ANDY CHEN: But it could be so many things. And this, in a sense is telling me what I should want, a histogram. COLTON OGDEN: Yeah, that is interesting. I'm not sure. You would think it'd be something like style. ANDY CHEN: You would think, yeah. But I don't know. Maybe I'm just not used to it. But-- COLTON OGDEN: [? Babbick ?] said, "This reminded me of 6,002x." I'm not sure what that is. Do you know that is? ANDY CHEN: 6,002x? Uh-uh. COLTON OGDEN: I'm not sure what that is, [? Babbick. ?] ANDY CHEN: Oh, 6002. Is that an MIT class? COLTON OGDEN: Oh, 6002x? Yeah, that might be. ANDY CHEN: The online version of it. COLTON OGDEN: Yeah. ANDY CHEN: That's, course, 6 is electrical and electrical engineering, computer science. I don't know 6002 though. COLTON OGDEN: That would make sense. We can we can use the Google machine. 6002x course, maybe? Circuits and electronics? Is what it is? MIT 6 point-- yeah, it's circuits and electronics. ANDY CHEN: Cool. COLTON OGDEN: Interesting. ANDY CHEN: 2 plot. Oh, OK. So once you get started we have to call the library. COLTON OGDEN: Ah, interesting. ANDY CHEN: There we go. COLTON OGDEN: OK. Oh, it's like requiring it. ANDY CHEN: Yeah. COLTON OGDEN: Got it. Makes sense. ANDY CHEN: Let's run it. OK. COLTON OGDEN: Oh, nice. ANDY CHEN: We have a graph. COLTON OGDEN: And you have like a dedicated browser to see-- ANDY CHEN: This little guy right here. COLTON OGDEN: --on the right side there, we're a little bit-- I can hide us briefly. It's going to break everyone's heart but-- we don't have a fancy transition off of that either. ANDY CHEN: Oh, are we gone? That's OK. The world doesn't need to see our pretty faces. COLTON OGDEN: What would we do with out Google says [INAUDIBLE]?? ANDY CHEN: Yahoo? Although, I would hate to see the world where we have to use Yahoo, or Bing, god forbid. COLTON OGDEN: "The results being significant is dependent on your alpha. If you're setting your own alpha wouldn't that make it very easy to manipulate the results into being significant, or not significant?" ANDY CHEN: Great question. So, yes. In a lot of scientific fields, psychology, especially I think, is pretty susceptible to this. From what I've been told. I'm not-- I don't have-- just from my friends who are in that field. Yes. Essentially, the way the statistics works is it assumes that there is an underlying distribution for sampling events. That like if you flip a coin 1,000 times there is some distribution that it's going to show up in. Like there will be 500 heads, 500 tails. That's a binomial distribution. I forget, that might be a Bernoulli distribution. But the assumption is that in nature there are some kinds of distributions that actually exist. If you perform some kind of event so many times it will follow a general pattern. The more you perform, the smoother the pattern will be. And so a lot of the kind of statistics that we're doing today is we're assuming they follow a normal, or a Gaussian distribution. And so, yes, you're absolutely right. The alpha that we set is arbitrary. So if you think about it in that terms, let's say, simplistically, one journal only publishes articles that are very just only t-tests, like a single t-test. If our alpha is .05, that means that 5% of the time, or 1 in every 20 of these articles, one of your answers is going to be incorrect. It's going to be-- you think it's statistically significant, but it's actually happening 1 out of every 20 times. So yes, statistics is like that. It's we're assuming a lot about the natural world that it follow certain rules, certain distributions of events. And we're applying statistics to them to sort of quantify the likelihood of something being true, or not true. But that's statistics. COLTON OGDEN: [INAUDIBLE] is saying, "I was promoting the stream to a Columbia professor, and he said, R was for statistics, and Python for DNA comparison." ANDY CHEN: Yeah, actually, I would agree with that to some degree. I do most of my bioinformatics work in Python. Although, Professor Rafael Irizarry, who is a Bion for Mass professor at the T.H. Chan School, the Harvard School of Public Health. He does a lot of his work, I think, in R. So you can conceivably do either in either. COLTON OGDEN: The saying is, different strokes for different folks Right? ANDY CHEN: That's true. Different strokes for different folks. COLTON OGDEN: I just realized that this chat is a little small. Because I had to shrink it down yesterday. Let's make it a little bit bigger. There we go. ANDY CHEN: Cool. Cool. COLTON OGDEN: That looks nicer. ANDY CHEN: It does look nice. And so do you. COLTON OGDEN: Thank you. I appreciate that. You look amazing. Look at you. ANDY CHEN: Ah. OK. So we've made a plot. And so, the one thing that's interesting about-- so we're assuming-- this is actually an interesting plot because we were talking about Gaussian plots earlier, very normal distributed. This is not a very normal plot. That might be OK, depending on what we want to do with this data. But I want to talk about some characteristics. It's very piqued, in a term in statistics to use plots that are-- so let's say a normal plot looks like this. If it looks like this, like a very large peak, we call that leptokurtic. And if it's very flat-- if it looks like-- so let's say this is a normal curve, if it looks like, boop, just like that-- we call that platikurtic. And this also is-- it gets like sort of narrow on the right side. Let me see if you can't see my head. Oh, dude. What are these? We call that right skew. COLTON OGDEN: What I should do is just, whoop. ANDY CHEN: Oh, look at that. COLTON OGDEN: And then just gives us a little shrink. We did this on the first stream, actually. ANDY CHEN: Shrinking boys. COLTON OGDEN: Use a little shrink down. ANDY CHEN: A little shrinking shrink. COLTON OGDEN: I gave it a little shrink down. ANDY CHEN: I'm into it. COLTON OGDEN: Something like that. ANDY CHEN: I just got really small. Yeah. So this is a right skewed curve, excuse me. And if the tail were on the other side, it would be a left skew. And so these are just terms to describe the shape of a distribution. COLTON OGDEN: OK. Makes sense. ANDY CHEN: Oh, a histogram. We should definitely explain that. I apologize. A histogram is a certain kind of chart, a certain kind of diagram, that shows exactly like-- is that [? Babbick? ?] [? Babbick ?] is saying-- COLTON OGDEN: [INAUDIBLE]. ANDY CHEN: Is it the frequency? Yeah. [? Babbick ?] is saying, it's showing the frequency for a certain thing. So it's actually-- it's sort of uni-dimensional. It's not like a regular plot that has a x and y-axis, where an x is an independent variable, and a y is a dependent variable. A histogram only has an x, in the sense that, the only thing that's like data is the x-- it's the BMI. And then the y-axis is showing frequency-- the number of times that this particular value occurs. So for instance, a BMI between 0 and, this looks like it might be 15, occurs maybe 240 times. Whereas a BMI between 16 and looks like 25 occurs probably like 1700 times. So that's what a histogram is. COLTON OGDEN: That would make sense. [? Babbick, ?] is this BMI histogram skewed left? ANDY CHEN: It's skewed right. COLTON OGDEN: Oh, OK. ANDY CHEN: I'm pretty sure it's skewed right. The skew, like, the tail, the part the slimmer, I think that's the direction you use to describe if it's left or right. COLTON OGDEN: Interesting. Then it's going to be the right skew then. Yeah. Because nobody over 40, there's like very few people over like 40 to 60 BMI. Because at that point, that's vary-- ANDY CHEN: Right. Because that's-- yeah, that might not be humanely possible to be there, to have that kind of BMI. OK. So now that we've visualized our data, and this is generally good stuff. There are more visualizations that you should do to test your assumptions. Let's actually try to perform this little test. We're trying to perform a t-test, which is comparing the means and the mean BMI values between two populations, males and females, in our pediatric subset of NHANES. And so, the way that we do that-- oh, let's overwrite some. So we noticed that there are some data entries that have empty BMIs. Wait. This is not letting me get in here. COLTON OGDEN: [? Magnus ?] is saying, Is there somebody that's around 50? It looks like there is in the data set. It's just very small. Right? ANDY CHEN: It might be. Yeah. There might be a super-- COLTON OGDEN: Because it looks like the red line goes all the way up to 60 something, 65, 63? ANDY CHEN: Yeah that's really, really high. Yeah, I think that's an individual, and someone who's under 18, a pediatric individual who has a very high BMI. There we go. All right, let's make some space. Great. So we noticed earlier, probably 20 minutes ago I guess, in the stream, some of our BMI individuals were missing. And again, I'm going to reiterate this as very potent statistics, be sure to consider and at least explain why you decided it's reasonable to omit data. And I'm just going to hand wave over that for the express purposes of demonstrating how to perform a t-test in R. You should, you know, explain to yourself, be able to explain to anyone, who you're showing your results to why you've entered this data. So the way that I'm going to do that is I'm going to overwrite NHANES pediatric. Again, called the subset function with NHANES itself. This time with not is.na. Remember, is.na means, that the thing itself is missing, is na. But not is.na is, it is not missing the thing. For gender as well as for BMI. COLTON OGDEN: Andre is asking, "If the data were not grouped into batches of 10 would it not fit a Gaussian curve?" ANDY CHEN: Oh, that's a very good question. So this, I set the binwidth as 10. I can-- so right here in line 24, I can set that to be 1 if I would like. It's an arbitrary binwidth. And so that shows a much clearer resolution. But a better way to think about this is, a Gaussian curve is actually not discrete. A word which means that is like sort of individual numbers. It's a continuous, it's a smooth curve. So a Gaussian distribution does not actually have individual rows. It just has, choo, all of the all the things in a single curve. And so to demonstrate that, I chose-- I had arbitrarily chosen binwidth of 10. But we can choose 1. We can choose 0.5. We can choose whatever we want. So in line 26, what we've done is we have resubsetted our data, and this time we've removed any individuals that are missing entries for gender, or BMI. And so now that we've done that, let's perform some actual statistics. We're getting there. COLTON OGDEN: Here we go. Here we go. ANDY CHEN: All right. Thank you so much for bearing with us. COLTON OGDEN: Asymptotically approaching, as David would say. ANDY CHEN: Absolutely. Oh, yeah. OK. So let's save-- so tapply is a function that lets us take in our data set as a argument, specifically looking at the BMI variable. And we will also compare that to the same data set, but this time with gender. And then the last argument in tapply is going to be what actual thing you want from it. And the first thing we want to do is, I want take it's minimum. And then-- so I'm actually going to do that a bunch but with max, mean-- oops, won't do that. Whoops, there we go. Mean-- mean-- standard deviations-- and then we'll just run all those. Run. Run. Run. Run. So these are all saved. We haven't seen them in our console because we saved them to a variable. But something that is cool that I'm about to do is, we can actually cbind all of these things into a single table. Which again, is a very useful one line command in R that might be difficult to do in Excel. You can probably do it very smartly in Python, though. So we're going to make a new summary table variable. And we're going to data.frame and save as a data frame. cbind of min, max, mean, and SD, which are the barriers we just made. And then we're also going to do summary table again, just to display it. So line 33 is going to save all those things as summary table. And then line 34 is going to run it and print it. And there you go. We have a very beautiful, very easy to read, table. For females, the minimum value is 12.88 for BMI. For males, the minimum value is 12.89. The mean is 20.49 for females. The means 20.05 for males. And so we have our basic summary statistics. COLTON OGDEN: Interesting. ANDY CHEN: Now, looking at this Colton, and knowing the sample size is approximately 1,000 individuals for males and females, do you think there is a significant difference between the average BMI for males versus females in this pediatric data set? COLTON OGDEN: Substantial? ANDY CHEN: Significant. COLTON OGDEN: I don't know if I would describe it as being significant. Would you describe it as being significant? ANDY CHEN: Well, that's exactly the answer I wanted. You can't tell. So these are summary statistics which tell you a mean, or median, as well as a measure of center, which is saying sort of like, what is like, the middle kind of value. Whereas, standard deviation is a measure of dispersion, which tells you how spread apart is your data. But it doesn't actually tell you anything about if these two populations are similar, or dissimilar. And so to do that we would have to use a statistical test. In this case, since we're comparing a binary variable with a continuous numerical variable we have to use a t-test. And so the syntax for that-- COLTON OGDEN: And also Fatma, in the chats, said, "Can R be added to the sandbox@cs50.io. And from what I see here, I went on to sandbox@cs50.io it is already-- we do have an R option. Probably don't have RStudio available. But the R command line environment we do have. It looks like we do have a sandbox setup for that already. ANDY CHEN: Great. Nice. So if you want to teach yourself how to use the command line, that's a great skill to have too. "Are we going to regression?" I don't think we will hit regression today. We are going to do a t-test. If we have time, we'll do an ANOVA. Actually, if we have time, we'll do a linear regression. But yeah, we'll see how far we get today. So t-test, let's see-- so again, when we look at p values, right? We determine if something a significant, or insignificant. In statistics, before we actually do the analysis, we have to come up with something called h not which is a null hypothesis. And something called h sub a or an alternative hypothesis. So in statistics, the default is you assume that there is no significant difference. Right? That is the default. So the null hypothesis here is there is no significant difference between the mean BMI for female versus males in this sample population. That is the null hypothesis. The alternative hypothesis is there is a significant difference between the average BMI for males and females in this population. And so those are sort of the underlying statements that you're working with. And that's what your p value for your t-test actually tells you to either accept, or reject, the null hypothesis. And so we'll get to that in one minute. Let me actually perform the t-test. The syntax for that is t.test BMI. So the first thing that goes into the function call is the continuous variable that you're comparing-- or the categorical variable that you're comparing. Sorry-- the continuous variable that we're comparing-- the numbers, the thing that actually has a mean, or an average. And we're going to compare against the categorical. In our case, the males versus female. In this case, it's a dichotomous variable, which is an instance of a [? categorical ?] variable. Let me do gender. We have to tell it that the data we're using is our NHANES data set. And we're going to assume that variance is equal to true. So we could talk about that in a little bit, but there are actually a few different kinds of t-tests where you can assume that the variances are true, or not true. And the reason that they're different is the statistical power is-- the way it's implemented is actually slightly different. In this case, we're assuming that for the most part, male and female participants in NHANES are, for the most part, very similarly except for their sex. I should probably put this in my script so I don't lose it. So let's run that. And here's what happens. We get this fees out-- two sample t-test. So remember, that's another kind of-- there are different kinds of t-tests. One of them is two samples, which is you're comparing two different populations. You can also compare a single, one sample t-tests, in which it's something sort of like a before and after. Because otherwise, it's exactly the same except for the one test condition that you've tested. We look at the p value here-- 0.08402. Now if we had chosen an alpha 0.05. It is greater than our alpha of 0.05, which means we have to accept the null hypothesis, which Colton suggests that this data is significant, or insignificant? COLTON OGDEN: Well, repeat the question one more time. ANDY CHEN: The null hypothesis is that there is no significant difference between the mean BMIs of males versus females. COLTON OGDEN: And you're if it's greater than the alpha 0.05 then that is the case? ANDY CHEN: Well, you fail to reject. COLTON OGDEN: Which means then, that we have to say that there is no big major significant. Right? Am I wrong? ANDY CHEN: No. You're correct in your intuition. But I just want to be-- in statistics, people are very careful. You don't really-- you fail to reject the null hypothesis. COLTON OGDEN: Right. So we can't-- we're not asserting either, or the other, but we are rejecting the null hypothesis-- ANDY CHEN: Absolutely. COLTON OGDEN: --in this instance. ANDY CHEN: Exactly. COLTON OGDEN: OK. We're not establishing the hypothesis as being true, the null hypothesis as being true. Or the opposite the null hypothesis as being true. ANDY CHEN: Exactly. COLTON OGDEN: Right. OK. ANDY CHEN: So let's actually assume that instead of saying 0.08, it said 0.03. Now what would happen? COLTON OGDEN: 0.013? ANDY CHEN: Just less than 0.05. COLTON OGDEN: Then we have to say that we can-- what is the opposite of-- ANDY CHEN: You reject the null hypothesis. COLTON OGDEN: Reject the null hypothesis. ANDY CHEN: So you don't really accept the alternative hypothesis, but-- because you can't really say for certain, or most at statisticians would hesitate to say for certain, oh, there is a significant difference. Well, I guess that is what you would say. But a sort of more careful, more conservative, way of saying that is like, oh, we just can't reject-- we can't-- we fail to reject the null hypothesis. COLTON OGDEN: OK. So we're basically avoiding making any assertions either way. ANDY CHEN: Pretty much. Yeah. COLTON OGDEN: OK. At which point can we assert either situation, either case? ANDY CHEN: In scientific writing, when people use statistical tests, if the p value is less than your alpha value, what you'll usually say is, we found significant results. COLTON OGDEN: OK. ANDY CHEN: Although a statistician might be a little more conservative about how they would say that. COLTON OGDEN: OK So this has to kind of undergo repeated testing, and repeated samples, and repeated sort of accepting, or rejecting of a null hypothesis over the course of time before we can affirmatively say one way or the other? ANDY CHEN: Ideally. COLTON OGDEN: And even then, it's probably still not 100%. ANDY CHEN: Ideally. Yeah, that doesn't usually happen, but yeah, you're absolutely on the right track in your intuition of how the statistics works. COLTON OGDEN: OK. That makes sense. ANDY CHEN: But yeah. Let's see if there's any-- COLTON OGDEN: Fatmo was saying that they were having internal server error. So the-- it looks like sandbox is, at least in America, we can see that it's up, but it's experiencing updates so it's off line, technically. But the landing page is still up. So I think we're updating it probably for the hackathon, or whatnot. It should be up at some point in the near future. Can you tell the difference between one sample and two sample t-tests again, [INAUDIBLE]. ANDY CHEN: Sure. Let's actually look it up so I am not-- one sample, versus two sample t-test. So a two sample t-test has two populations that are different in some sense. So males and females are two separate individuals in this NHANES t-set. Whereas a one sample t-test is-- ah! Sorry, I actually-- so I misunderstood-- I misexplained how-- I misexplained one simple t-tests in the beginning. A two sample t-test is comparing if two populations, two sample populations, have a significant difference in their average for whatever variable you're looking at. A one sample t-test is comparing a known true population mean, some value, to the sample you're saying, and seeing if there's a statistical difference there. So I'm trying to think off the top of my head of there's a good-- let's say that we know that the average graduation-- or let's say that we know that the population of a city is 50o,000. We just know this for a fact. We would perform a one sample t-test if we're trying to compare if it's statistically significant that one of our samples gives us 455,000 instead of 5000,000. And then in that particular instance you would use a one sample t-test. COLTON OGDEN: Cool. Cool. Makes sense? ANDY CHEN: Yep. Cool. So I think, unless there are a lot of questions on t-tests, we can actually probably go on to ANOVA. COLTON OGDEN: Sure. Let's do it. ANDY CHEN: Sure. So we talked earlier about different kinds of variables. T-tests are useful for when you are comparing the difference of means between two populations. But let's assume, for instance, that you have a categorical variable that has multiple categories-- young, old, medium aged, young adult, et cetera. And you still want to compare an average of some kind of continuous variable between them. So what is the average height of a child, average height of an infant, average height of an adult, average height of an elderly person, and you're trying to look to see if there's a statistically significant difference between those populations. In that particular instance, you would use a test called an ANOVA, or an analysis of variance. So there are actually a lot of assumptions for each of these tests that we're using, and I just want to reiterate that. There are-- you should look at your data, see if there are outliers, and see if they meet the parametric assumptions of each of these tests that you're using. All these test we're using here today are parametric. If not, you can use something called, non-parametric tests, which are sort of similar in what they do, but are probably not as statistically powerful. But everything we're doing today is parametric. So we're just going to assume that we've met all these assumptions to use these tests. So ANOVA-- let's look at ANOVA. Let's make a variable called, ANOVA BMI with race. And so the category of race in our data set is called, Race 1, is looking at if-- I guess the surveys have, do the individuals identify as white, or as black, or Mexican, or other, or Hispanic, or et cetera. And so, as we can tell, these are not numbers, these are categories and there's more than two of them. And let's compare that to BMI again. Just because that's what we've been using. So in this particular instance, the statistics question we're asking is, is there a statistically significant difference in the average BMI between these ethnicities, or between these races? I think is the term they use in this data set. And so the way that we would do that is we would call aov, is the syntax in R. aov of BMI, which is your continuous variable. And then, squiggly enyay to race 1, which again, is the name of our variable for race. Oh, and the second argument is the identify your data. We're going to be looking at our NHANES pediatric that we made. And then we are going to call-- OK. So let's actually do that. We run this function and it saves your ANOVA results into a variable called ANOVA BMI race. Now to actually access it, we do summary of a ANOVA BMI race, right here. And so if we run that, it prints out our [INAUDIBLE] results. Excuse me. So if you are interested in ANOVA a lot of these are-- some of this data is actually-- some of these numbers are actually very important. The f value, for instance, is a very useful thing. But we're just going to be looking at the statistical significance for right now. Our p value is 0.00781 star, star, which means its alpha is 0.005. That value is less than 0.005. No. That's not true. Well, regardless. If we had set our alpha as 0.05, this value is less than that. Which means we would say, we fail to reject the null hypothesis. And so in this particular instance, we would have set the null hypothesis before, and it would be the null hypothesis is that there is no significant difference in the mean BMIs between categories of race-- white, Mexican, other whatever the categories had. And so because our p value is less than 0.05, we would fail to reach our null hypothesis, which means that we would probably say something along the lines of, we find that there is a significant difference in the BMI means across these categories. So Colton, can you see why this might be more ambiguous than a t-test, which only has two categories? COLTON OGDEN: Well, I mean, there's just so many races. I don't know if that's part of the issue. ANDY CHEN: Right. Absolutely. If I find-- COLTON OGDEN: And also people can subjectively identify as multiple different races a lot of the time. Because people can have parents of different races. ANDY CHEN: Absolutely true. Yeah. From a statistical standpoint, I think the first note is now that I know it's significant, where is the significance? Is it between white and black? Is it between Mexican and white? Is it between other and white? It's ambiguous. COLTON OGDEN: And you have to take probably smaller samples, separate samples, and see how they compare against each other. Many different sort of permutations. ANDY CHEN: Actually, yeah, that's absolutely right, actually. So that actually goes a little more low level into what I'm about to show. But so in ANOVA, because your p value is ambiguous, because your-- I was just reading one of the comments. COLTON OGDEN: The last one? ANDY CHEN: Yeah. COLTON OGDEN: Would you have just a general statistics review approaching AP testing time? So informative. Maybe. I don't know when AP testing time is. ANDY CHEN: I don't know when that is. But yeah, perhaps we could maybe do that. We're talking about-- oh, right. It's ambiguous. We don't know where the statistically significant differences is. We don't know between which categories, or between all categories is. So we do something called, post hoc, Latin for, after the fact tests. One of which we'll show. There are different kinds of tests that you use in different circumstances but we're going to use one called the tukey's post hoc test. So the way we do that is, tukey HSD of ANOVA, or of the variable we just made. And so if we run that, we find the printouts of p values of the categories compared to each other. So you were actually saying you run subsets of-- basically what this is doing, is it's very similar to running t-tests within all the possible categories. COLTON OGDEN: Right. OK. That makes sense. ANDY CHEN: And so the reason it's p adjusted, is because when you do that the degrees of freedom changes. And so you actually sort of inflate or your power. And so the tukey's t-test is one implementation of this kind of sequential sub partitioning. And so this what this gives us is, we look at the differences-- is the average BMI of Hispanic versus black individuals significantly different? The p value here is 0.60. It's not significant at all. And so most of these are not. But if we look at, other with black, then what this is saying is, the null hypothesis here, like the special sub null hypothesis here, is that is there a significant difference between the average BMI for individuals who are other compared to individuals who are black? In this case, our p value is 0.03, which is less than our alpha of 0.05. In which case we would say, we failed to reject our null hypothesis, that there is not a significant difference. And so you would say, we find a significant difference between the mean BMI for individuals who identify as other compared to those who identify as black. And so it shows you all the possible category differences. So it looks like there is five categories? 1, 2, 3, 4, 5. Yeah, Hispanic, black, Mexican, other, and white, and so it's looking at all the possible permutations between, and looking at if those specific two categories are significantly different enough. And so that's how you would analyze a ANOVA. COLTON OGDEN: Get a little bit more granular. ANDY CHEN: Getting a little bit more low level under the hood as David might say. OK. I think we actually have time for a linear regression. I think that will probably be the last thing we talked about though. COLTON OGDEN: OK. Sounds good. Let's look at it. ANDY CHEN: All right. Oh, this is a lot of code to type. OK. So again, a very, very good statistical habit to get into is visualizing your data before you make analysis, just to understand what's going on if you have a lot of outliers, et cetera. And so what we're going to do, is we're going to visualize the NHANES pediatric data set that we made of course. X equals age, y equals height, plus [? giam, ?] point. Let's run that. All right so what this did, is it's printing out, in it's lower right corner here. All of the individuals of various heights in our pediatric data. So from 0 to eight-- or sorry, ages from 0 to 18, and their heights. And so you'll notice that it has a weird distribution where it's very, you know, blocked. And the reason for that is because we don't think about age in terms of continuous. I'm not 24 and 3/4 years old. Right? I identify as 24. And so that's why each of these only is the height for one-year-old, 2-year-old, 3-year-old, four-year-old, not 2.4856 six-year-old. And that's why it has that distribution like that. COLTON OGDEN: Right that's supposed to be a continuous graph. ANDY CHEN: Exactly. That's exactly right. And so now that we see this, Colton, if I were if I were to give this data to you, do you see-- do you think that there is a general trend, the correlation between age and height? COLTON OGDEN: It looks like a small one. As you get older, you tend to get taller. ANDY CHEN: Yeah. Right. And so intuitively that makes sense to us. I chose this example because it makes sense that the older you get, the taller you taller you get. COLTON OGDEN: And also range in height tends to grow as well. ANDY CHEN: That's true. That's actually, really, yeah. That's really interesting. I hadn't noticed that. I mean that makes a lot of sense. Right? Because like we all start about the same size, give or take. COLTON OGDEN: Yeah. People grow at different rates, different sizes. ANDY CHEN: Absolutely. Oh, one thing I forgot to mention is-- COLTON OGDEN: Oh, sorry, [? Babbick, ?] let me let me go ahead and move it. Move it right over-- I'll make it smaller again, how's that? Let's go up to here. Very tiny chat so we can see the graph just while we're talking about it. ANDY CHEN: So linear regression is an instance of what-- is a specific instance when we're comparing continuous quantitative variable with another quantity variable. Age is a number from 0 to 18. In theory, you could actually have this like-- well, I have them as discrete numbers 1, 2, 3, 4, 5, 6 but you can have it as like 1.5, 2.8, whatever. And height is also a discrete number, from 0 to 175 centimeters. And so when you're trying to compare two continuous variables, two quantitative variables, you use linear regression. It's actually really similar to an ANOVA except for the independent variable in an ANOVA is categorical instead of continuous. Right. OK. So, so the way that we would perform linear regression-- oh, so I think there tends to be-- it looks like there might be a linear regression here. And there might be a better model that actually describes it. But for our purposes let's perform a linear regression. Because I think someone was suggesting we do a regression. So ggplot. It's the same exact thing, but we'll add some lines. It's starting to get really unwieldy, which is one of the reasons I don't like ggplot, is I have to-- look at what I'm about to write here. Smooth method = lm. lm is linear model. S = true. I don't know what that is. Full range = false. COLTON OGDEN: And also, [? Fatma ?] thank you for the kind words. ANDY CHEN: 0 +-- COLTON OGDEN: And [? Babbick ?] as well. ANDY CHEN: 0.95. There we go. Run. Cool. So we have the graph we had before, and we've actually fitted a model to it. A linear regression. A line model. So recall that in linear regression makes sense, in terms of, these are all models, statistics is all about modeling. If you have a categorical variable, how can you make a line with that? Right? There's no axes to make lines on. But if you have two continuous variables, you absolutely can make a line. Which is why a linear regression makes sense if you're comparing a continuous with another continuous variable. So I just wanted to plot out what it actually looks like in this particular-- for when you're comparing a continuous variable exchange variable and it's a linear regression. But to actually perform it in R, you perform the following. We're going to make a variable called, linear regression. And the syntax is lm, which stands for linear model. We're going to do our-- Oh, did I do weight? Ah, that's fine. Our independent variable, and then squiggly enyay thing, with our-- sorry, dependent variable. And then our independent variable on the right. Data is again NHANES pediatric. And, yeah. So we'll run that. And then we'll call summary, which is a function that gives you the summary for certain kinds of data. Call that and look. We get our linear regression summary. So there are multiple p values here but I could go into depth in a little bit more perhaps next time, but essentially what's happening, is there are actually two things that come out of a linear regression. Remember it's y equals mx + b because it's a line. There are two variables here-- or there two things that we could potentially plot. m, which is the intercept. Oh, sorry! b which is the intercept. And m, which is the slope. So linear regressions are very useful, because you want to say, hey, as age goes up by x, how much does y go up? Or in this case, if age increases by 4.25 years-- as age increases by 1, height increases by 4.25 centimeters. And that has a p value of less than-- a tiny, tiny p value of 2 times 10 to the negative 16. So that's way below pretty much any alpha which tells you that there is 100-- there is almost-- it's statistically almost impossible for there not to be an actual significance here. COLTON OGDEN: Right. That makes sense. ANDY CHEN: So there's also the intercepts, which is sort of like remember it's a lie, and it doesn't make a lot of sense in the real world, but it's saying, if your age is 0, your height is going to be 1.43 centimeters. Doesn't make a lot of sense in terms of being an actual baby. Right? But that's just how it a regression works, so you're fitting a lie to real data. So cool. I think that I think that kind of finishes off our conversation for today. COLTON OGDEN: Yeah. That was pretty cool. Getting to see the fact that you get all these nice graphical tools as well. Not only just being able to model the data see the variables that matter to you but also model them visually. I like seeing things visually. I think that's important. [? Fatmo's ?] saying, "I think all of CS50 is awesome, they train well. Andy being a star among them. ANDY CHEN: Ah, thank you so much. COLTON OGDEN: "Colton's [INAUDIBLE] virtual office hours is unique." It's a fun time. I'm glad everybody's joining in and having fun. This is my first exposure to R. So thank you very much for coming on to the show and-- ANDY CHEN: Yeah. Absolutely. COLTON OGDEN: --educating us on R and RStudio. This is pretty cool. ANDY CHEN: Yeah. I needed to refresh myself. It's been a while. COLTON OGDEN: Yeah. Yeah. I know I can imagine. Some of the function calls get pretty bulky there. ANDY CHEN: Yeah I don't think I could've done this without notes. COLTON OGDEN: Oh, yeah. It'd be tough. I can imagine it being tough. Yeah, this is great. Thank you so much. And then everybody who wants to follow along, or join after the fact, R and RStudio are free. So as we talked about earlier in the chat, definitely-- early in the stream, and in the chat, definitely grab those, and mess around. Hopefully the sandbox is up and running with R function, as well, in the near future. This is back to the firewall screen. If anybody has any last questions before we wrap up the stream, definitely let us know. We'll stick around for just a couple more minutes. This week is the CS50 hackathon. So tomorrow is the hackathon, Andy and I will be there. Going into Friday, because it's an all nighter. So we start at night and then it goes until, I would say 5:00 in the morning, or 6:00 in the morning. ANDY CHEN: 5:00 In the morning. COLTON OGDEN: We go to IHOP. Yeah. That'll be great. Is it your first or second hackathon? ANDY CHEN: This is my first hackathon. COLTON OGDEN: OK. Nice. ANDY CHEN: When I took this class, I took it in Kenya. COLTON OGDEN: Oh, OK. Yeah, you could have had to vicariously sort of be a part of the hackathon. ANDY CHEN: I saw the-- I liked the videos. COLTON OGDEN: Yeah, no, we have the hackathon. So we won't be streaming this week. We will stream next week. So we have a stream with Nick on Tuesday. And then I'll probably do a stream on Wednesday, is probably actually the day that I'll do a stream. We'll finish up Space Invaders. And I think Monday, we have another stream lined up. I need to check my calendar just to 100% verify the stream schedule for next week. But I do believe that, that is what we have set up. ANDY CHEN: Nice. COLTON OGDEN: Ba,ba, ba, ba, ba. ANDY CHEN: Ba, ba, ba, ba. COLTON OGDEN: Yeah. So next week we have David and I on Monday for the surprise. So I'm not going to spoil that. And then on Tuesday, Nick will be joining us for a C basics tutorial. And then Wednesday, I will be-- we'll finish up Space Invaders. So yeah, that will be a great time. [INAUDIBLE] saying, "This is really cool. Hope to catch more of these. Stats can be boring, but you make it fun." ANDY CHEN: Ah, thank you so much. If you call it-- here's a trick. If you call it, data science, people think it's really awesome. COLTON OGDEN: Yeah. There you go. And they pay more right? ANDY CHEN: They pay you way more. But it's sort of the same thing. COLTON OGDEN: says, "Thank you Andy, and Colton, very interesting stream." Yeah, thanks for tuning in. Thanks so much. "I was going to opt out today, but [? Babbick Night ?] efforts to make it made me stay tuned." Yeah, so thanks everybody, helping each other stay tuned in. "Love the surprises," says [? Azley. ?] Yeah, me too. This will be a great stream. So yeah, thanks again, Andy. ANDY CHEN: Yeah, of course. COLTON OGDEN: We'll have you again, probably in the spring at some point, or [? J term ?] time-- ANDY CHEN: Sure. Yeah. I'll be around. COLTON OGDEN: --to follow up stream on something, whether it's stats related or otherwise. Cybersecurity and data science, let's catch followers. Says [? Fatmo, ?] "Is there any first steps class you suggest, or is there any intro class that has a professor with good reviews?" ANDY CHEN: At Harvard? Or-- COLTON OGDEN: Maybe just any. ANDY CHEN: So edX, Harvard X has an offering called Stat100x, [INAUDIBLE] because I worked on it. COLTON OGDEN: There you go. ANDY CHEN: It's pretty intense. It's an introduction to probability theory, which actually isn't what we're doing today, but it's the background for why what we're doing to today, works. If that's what you're interested in. COLTON OGDEN: More rigorous? ANDY CHEN: Yeah. It's very difficult. COLTON OGDEN: You'll have to check that out. What was the name of it again? ANDY CHEN: s Stats 110x on edX. It's probably offered by Harvard Stat 110x. COLTON OGDEN: OK. ANDY CHEN: Professor is Joe [? Blitstein. ?] COLTON OGDEN: Check it out. ANDY CHEN: Yeah. COLTON OGDEN: You learn some stats. Actually, [? dive ?] into a little bit at some point. ANDY CHEN: Yeah. I forget all of it. COLTON OGDEN: Maybe some R. Some Python data science stuff sounds pretty cool. ANDY CHEN: Name some Rthon. Is that a thing? Rthon. COLTON OGDEN: Rthon? Maybe, I'm not sure. Integrate RStudio with Python. "Thanks for the stream. I want to ask the logic behind the p value, but I know it's the end of the stream so its cool if you don't have time to explain it. ANDY CHEN: The logic behind the p value is we assume that everything that we sample, all events, happen along a distribution. In this instance, we assume it's a normal distribution. And you p value is saying, what is the likelihood that the actual result we get is-- so if you think about the normal distribution. When we think about the area under the curve as the possible region where conceivably, that event could happen. The likelihood of that happening-- there is a very small area on both sides of the tail that are extremely unlikely that cover 5% of that area, and that's actually two standard deviations from the mean. And we're saying that we're OK, essentially, if our sample-- if our results actually were in that area. It's saying like, the likelihood of it being in this area is so far gone, it's 5%, that we're OK with it. But what that also means, is every 1 out of every 20 times you do it you're going to get not a great result. But yeah. COLTON OGDEN: And tune in for the stats course for probably more of the background on that, I'm guessing, right? ANDY CHEN: Yeah. COLTON OGDEN: Someone asked, JumpJump123, "Could you say the course name again?" ANDY CHEN: That would be Stat 100x. COLTON OGDEN: And then, [? TwitchHelloWorld, ?] aka, Jacques or Jack, says,"What did you say is the stream Professor Manning is doing Monday?" And that is a surprise. We're not going to spoil that one. Another surprise, you're going to have to tune in. Cool. ANDY CHEN: Nice. COLTON OGDEN: I think that was it. We're going to adjourn now. It's been a little over two hours. Thanks again, Andy-- ANDY CHEN: Yeah. Of course. COLTON OGDEN: --for coming on today. ANDY CHEN: Thanks for having me. COLTON OGDEN: It was great. It was great fun. Great time seeing R. I've always heard about it. Never actually seen it too much in practice. ANDY CHEN: [INAUDIBLE]. COLTON OGDEN: Physically seen it. But it was a great exposure for me. Have a great rest of the week, all. I think we'll probably post videos of hackathon related activities, and pictures, and whatnot on the Facebook group. And I will see all of you next Monday. ANDY CHEN: That's right. Have a great weekend in the meantime. COLTON OGDEN: All right, everybody, on that note, let's have a great rest of the week and weekend. See you next time. ANDY CHEN: Ciao.
B1 chen colton colton ogden ogden data bmi INTRO TO R AND BIOSTATS - CS50 on Twitch, EP. 19 3 0 林宜悉 posted on 2020/03/28 More Share Save Report Video vocabulary