Subtitles section Play video Print subtitles - So welcome everyone to CS231n. I'm super excited to offer this class again for the third time. It seems that every time we offer this class it's growing exponentially unlike most things in the world. This is the third time we're teaching this class. The first time we had 150 students. Last year, we had 350 students, so it doubled. This year we've doubled again to about 730 students when I checked this morning. So anyone who was not able to fit into the lecture hall I apologize. But, the videos will be up on the SCPD website within about two hours. So if you weren't able to come today, then you can still check it out within a couple hours. So this class CS231n is really about computer vision. And, what is computer vision? Computer vision is really the study of visual data. Since there's so many people enrolled in this class, I think I probably don't need to convince you that this is an important problem, but I'm still going to try to do that anyway. The amount of visual data in our world has really exploded to a ridiculous degree in the last couple of years. And, this is largely a result of the large number of sensors in the world. Probably most of us in this room are carrying around smartphones, and each smartphone has one, two, or maybe even three cameras on it. So I think on average there's even more cameras in the world than there are people. And, as a result of all of these sensors, there's just a crazy large, massive amount of visual data being produced out there in the world each day. So one statistic that I really like to kind of put this in perspective is a 2015 study from CISCO that estimated that by 2017 which is where we are now that roughly 80% of all traffic on the internet would be video. This is not even counting all the images and other types of visual data on the web. But, just from a pure number of bits perspective, the majority of bits flying around the internet are actually visual data. So it's really critical that we develop algorithms that can utilize and understand this data. However, there's a problem with visual data, and that's that it's really hard to understand. Sometimes we call visual data the dark matter of the internet in analogy with dark matter in physics. So for those of you who have heard of this in physics before, dark matter accounts for some astonishingly large fraction of the mass in the universe, and we know about it due to the existence of gravitational pulls on various celestial bodies and what not, but we can't directly observe it. And, visual data on the internet is much the same where it comprises the majority of bits flying around the internet, but it's very difficult for algorithms to actually go in and understand and see what exactly is comprising all the visual data on the web. Another statistic that I like is that of Youtube. So roughly every second of clock time that happens in the world, there's something like five hours of video being uploaded to Youtube. So if we just sit here and count, one, two, three, now there's 15 more hours of video on Youtube. Google has a lot of employees, but there's no way that they could ever have an employee sit down and watch and understand and annotate every video. So if they want to catalog and serve you relevant videos and maybe monetize by putting ads on those videos, it's really crucial that we develop technologies that can dive in and automatically understand the content of visual data. So this field of computer vision is truly an interdisciplinary field, and it touches on many different areas of science and engineering and technology. So obviously, computer vision's the center of the universe, but sort of as a constellation of fields around computer vision, we touch on areas like physics because we need to understand optics and image formation and how images are actually physically formed. We need to understand biology and psychology to understand how animal brains physically see and process visual information. We of course draw a lot on computer science, mathematics, and engineering as we actually strive to build computer systems that implement our computer vision algorithms. So a little bit more about where I'm coming from and about where the teaching staff of this course is coming from. Me and my co-instructor Serena are both PHD students in the Stanford Vision Lab which is headed by professor Fei-Fei Li, and our lab really focuses on machine learning and the computer science side of things. I work a little bit more on language and vision. I've done some projects in that. And, other folks in our group have worked a little bit on the neuroscience and cognitive science side of things. So as a bit of introduction, you might be curious about how this course relates to other courses at Stanford. So we kind of assume a basic introductory understanding of computer vision. So if you're kind of an undergrad, and you've never seen computer vision before, maybe you should've taken CS131 which was offered earlier this year by Fei-Fei and Juan Carlos Niebles. There was a course taught last quarter by Professor Chris Manning and Richard Socher about the intersection of deep learning and natural language processing. And, I imagine a number of you may have taken that course last quarter. There'll be some overlap between this course and that, but we're really focusing on the computer vision side of thing, and really focusing all of our motivation in computer vision. Also concurrently taught this quarter is CS231a taught by Professor Silvio Savarese. And, CS231a really focuses is a more all encompassing computer vision course. It's focusing on things like 3D reconstruction, on matching and robotic vision, and it's a bit more all encompassing with regards to vision than our course. And, this course, CS231n, really focuses on a particular class of algorithms revolving around neural networks and especially convolutional neural networks and their applications to various visual recognition tasks. Of course, there's also a number of seminar courses that are taught, and you'll have to check the syllabus and course schedule for more details on those 'cause they vary a bit each year. So this lecture is normally given by Professor Fei-Fei Li. Unfortunately, she wasn't able to be here today, so instead for the majority of the lecture we're going to tag team a little bit. She actually recorded a bit of pre-recorded audio describing to you the history of computer vision because this class is a computer vision course, and it's very critical and important that you understand the history and the context of all the existing work that led us to these developments of convolutional neural networks as we know them today. I'll let virtual Fei-Fei take over [laughing] and give you a brief introduction to the history of computer vision. Okay let's start with today's agenda. So we have two topics to cover one is a brief history of computer vision and the other one is the overview of our course CS 231 so we'll start with a very brief history of where vision comes from when did computer vision start and where we are today. The history the history of vision can go back many many years ago in fact about 543 million years ago. What was life like during that time? Well the earth was mostly water there were a few species of animals floating around in the ocean and life was very chill. Animals didn't move around much there they don't have eyes or anything when food swims by they grab them if the food didn't swim by they just float around but something really remarkable happened around 540 million years ago. From fossil studies zoologists found out within a very short period of time — ten million years — the number of animal species just exploded. It went from a few of them to hundreds of thousands and that was strange — what caused this? There were many theories but for many years it was a mystery evolutionary biologists call this evolution's Big Bang. A few years ago an Australian zoologist called Andrew Parker proposed one of the most convincing theory from the studies of fossils he discovered around 540 million years ago the first animals developed eyes and the onset of vision started this explosive speciation phase. Animals can suddenly see; once you can see life becomes much more proactive. Some predators went after prey and prey have to escape from predators so the evolution or onset of vision started a evolutionary arms race and animals had to evolve quickly in order to survive as a species so that was the beginning of vision in animals after 540 million years vision has developed into the biggest sensory system of almost all animals especially intelligent animals in humans we have almost 50% of the neurons in our cortex involved in visual processing it is the biggest sensory system that enables us to survive, work, move around, manipulate things, communicate, entertain, and many things. The vision is really important for animals and especially intelligent animals. So that was a quick story of biological vision. What about humans, the history of humans making mechanical vision or cameras? Well one of the early cameras that we know today is from the 1600s, the Renaissance period of time, camera obscura and this is a camera based on pinhole camera theories. It's very similar to, it's very similar to the to the early eyes that animals developed with a hole that collects lights and then a plane in the back of the camera that collects the information and project the imagery. So as cameras evolved, today we have cameras everywhere this is one of the most popular sensors people use from smartphones to to other sensors. In the mean time biologists started studying the mechanism of vision. One of the most influential work in both human vision where animal vision as well as that inspired computer vision is the work done by Hubel and Wiesel in the 50s and 60s using electrophysiology. What they were asking, the question is "what was the visual processing mechanism like in primates, in mammals" so they chose to study cat brain which is more or less similar to human brain from a visual processing point of view. What they did is to stick some electrodes in the back of the cat brain which is where the primary visual cortex area is and then look at what stimuli makes the neurons in the in the back in the primary visual cortex of cat brain respond excitedly what they learned is that there are many types of cells in the, in the primary visual cortex part of the the cat brain but one of the most important cell is the simple cells they respond to oriented edges when they move in certain directions. Of course there are also more complex cells but by and large what they discovered is visual processing starts with simple structure of the visual world, oriented edges and as information moves along the visual processing pathway the brain builds up the complexity of the visual information until it can recognize the complex visual world. So the history of computer vision also starts around early 60s. Block World is a set of work published by Larry Roberts which is widely known as one of the first, probably the first PhD thesis of computer vision where the visual world was simplified into simple geometric shapes and the goal is to be able to recognize them and reconstruct what these shapes are. In 1966 there was a now famous MIT summer project called "The Summer Vision Project." The goal of this Summer Vision Project, I read: "is an attempt to use our summer workers effectively in a construction of a significant part of a visual system." So the goal is in one summer we're gonna work out the bulk of the visual system. That was an ambitious goal. Fifty years have passed; the field of computer vision has blossomed from one summer project into a field of thousands of researchers worldwide still working on some of the most fundamental problems of vision. We still have not yet solved vision but it has grown into one of the most important and fastest growing areas of artificial intelligence. Another person that we should pay tribute to is David Marr. David Marr was a MIT vision scientist and he has written an influential book in the late 70s about what he thinks vision is and how we should go about computer vision and developing algorithms that can enable computers to recognize the visual world. The thought process in his, in David Mars book is that in order to take an image and arrive at a final holistic full 3d representation of the visual world we have to go through several process. The first process is what he calls "primal sketch;" this is where mostly the edges, the bars, the ends, the virtual lines, the curves, the boundaries, are represented and this is very much inspired by what neuroscientists have seen: Hubel and Wiesel told us the early stage of visual processing has a lot to do with simple structures like edges. Then the next step after the edges and the curves is what David Marr calls "two-and-a-half d sketch;" this is where we start to piece together the surfaces, the depth information, the layers, or the discontinuities of the visual scene, and then eventually we put everything together and have a 3d model hierarchically organized in terms of surface and volumetric primitives and so on. So that was a very idealized thought process of what vision is and this way of thinking actually has dominated computer vision for several decades and is also a very intuitive way for students to enter the field of vision and think about how we can deconstruct the visual information. Another very important seminal group of work happened in the 70s where people began to ask the question "how can we move beyond the simple block world and start recognizing or representing real world objects?" Think about the 70s, it's the time that there's very little data available; computers are extremely slow, PCs are not even around, but computer scientists are starting to think about how we can recognize and represent objects. So in Palo Alto both at Stanford as well as SRI, two groups of scientists that propose similar ideas: one is called "generalized cylinder," one is called "pictorial structure." The basic idea is that every object is composed of simple geometric primitives; for example a person can be pieced together by generalized cylindrical shapes or a person can be pieced together by critical part in their elastic distance between these parts so either representation is a way to reduce the complex structure of the object into a collection of simpler shapes and their geometric configuration. These work have been influential for quite a few, quite a few years and then in the 80s David Lowe, here is another example of thinking how to reconstruct or recognize the visual world from simple world structures, this work is by David Lowe which he tries to recognize razors by constructing lines and edges and and mostly straight lines and their combination. So there was a lot of effort in trying to think what what is the tasks in computer vision in the 60s 70s and 80s and frankly it was very hard to solve the problem of object recognition; everything I've shown you so far are very audacious ambitious attempts but they remain at the level of toy examples or just a few examples. Not a lot of progress have been made in terms of delivering something that can work in real world. So as people think about what are the problems to solving vision one important question came around is: if object recognition is too hard, maybe we should first do object segmentation, that is the task of taking an image and group the pixels into meaningful areas. We might not know the pixels that group together is called a person, but we can extract out all the pixels that belong to the person from its background; that is called image segmentation. So here's one very early seminal work by Jitendra Malik and his student Jianbo Shi from Berkeley from using a graph theory algorithm for the problem of image segmentation. Here's another problem that made some headway ahead of many other problems in computer vision, which is face detection. Faces one of the most important objects to humans, probably the most important objects to humans, around the time of 1999 to 2000 machine learning techniques, especially statistical machine learning techniques start to gain momentum. These are techniques such as support vector machines, boosting, graphical models, including the first wave of neural networks. One particular work that made a lot of contribution was using AdaBoost algorithm to do real-time face detection by Paul Viola and Michael Jones and there's a lot to admire in this work. It was done in 2001 when computer chips are still very very slow but they're able to do face detection in images in near-real-time and after the publication of this paper in five years time, 2006, Fujifilm rolled out the first digital camera that has a real-time face detector in the in the camera so it was a very rapid transfer from basic science research to real world application. So as a field we continue to explore how we can do object recognition better so one of the very influential way of thinking in the late 90s til the first 10 years of 2000 is feature based object recognition and here is a seminal work by David Lowe called SIFT feature. The idea is that to match and the entire object for example here is a stop sign to another stop sight is very difficult because there might be all kinds of changes due to camera angles, occlusion, viewpoint, lighting, and just the intrinsic variation of the object itself but it's inspired to observe that there are some parts of the object, some features, that tend to remain diagnostic and invariant to changes so the task of object recognition began with identifying these critical features on the object and then match the features to a similar object, that's a easier task than pattern matching the entire object. So here is a figure from his paper where it shows that a handful, several dozen SIFT features from one stop sign are identified and matched to the SIFT features of another stop sign. Using the same building block which is features, diagnostic features in images, we have as a field has made another step forward and start to recognizing holistic scenes. Here is an example algorithm called Spatial Pyramid Matching; the idea is that there are features in the images that can give us clues about which type of scene it is, whether it's a landscape or a kitchen or a highway and so on and this particular work takes these features from different parts of the image and in different resolutions and put them together in a feature descriptor and then we do support vector machine algorithm on top of that. Similarly a very similar work has gained momentum in human recognition so putting together these features well we have a number of work that looks at how we can compose human bodies in more realistic images and recognize them. So one work is called the "histogram of gradients," another work is called "deformable part models," so as you can see as we move from the 60s 70s 80s towards the first decade of the 21st century one thing is changing and that's the quality of the pictures were no longer, with the Internet the the the growth of the Internet the digital cameras were having better and better data to study computer vision. So one of the outcome in the early 2000s is that the field of computer vision has defined a very important building block problem to solve. It's not the only problem to solve but in terms of recognition this is a very important problem to solve which is object recognition. I talked about object recognition all along but in the early 2000s we began to have benchmark data set that can enable us to measure the progress of object recognition. One of the most influential benchmark data set is called PASCAL Visual Object Challenge, and it's a data set composed of 20 object classes, three of them are shown here: train, airplane, person; I think it also has cows, bottles, cats, and so on; and the data set is composed of several thousand to ten thousand images per category and then the field different groups develop algorithm to test against the testing set and see how we have made progress. So here is a figure that shows from year 2007 to year 2012. The performance on detecting objects the 20 object in this image in a in a benchmark data set has steadily increased. So there was a lot of progress made. Around that time a group of us from Princeton to Stanford also began to ask a harder question to ourselves as well as our field which is: are we ready to recognize every object or most of the object in the world. It's also motivated by an observation that is rooted in machine learning which is that most of the machine learning algorithms it doesn't matter if it's graphical model, or support vector machine, or AdaBoost, is very likely to overfit in the training process and part of the problem is visual data is very complex because it's complex our models tend to have a high dimension a high dimension of input and have to have a lot of parameters to fit and when we don't have enough training data overfitting happens very fast and then we cannot generalize very well. So motivated by this dual reason, one is just want to recognize the world of all the objects, the other one is to come back the machine learning overcome the the machine learning bottleneck of overfitting, we began this project called ImageNet. We wanted to put together the largest possible dataset of all the pictures we can find, the world of objects, and use that for training as well as for benchmarking. So it was a project that took us about three years, lots of hard work; it basically began with downloading billions of images from the internet organized by the dictionary we called WordNet which is tens of thousands of object classes and then we have to use some clever crowd engineering trick a method using Amazon Mechanical Turk platform to sort, clean, label each of the images. The end result is a ImageNet of almost 15 million or 40 million plus images organized in twenty-two thousand categories of objects and scenes and this is the gigantic, probably the biggest dataset produced in the field of AI at that time and it began to push forward the algorithm development of object recognition into another phase. Especially important is how to benchmark the progress so starting 2009 the ImageNet team rolled out an international challenge called ImageNet Large-Scale Visual Recognition Challenge and for this challenge we put together a more stringent test set of 1.4 million objects across 1,000 object classes and this is to test the image classification recognition results for the computer vision algorithms. So here's the example picture and if an algorithm can output 5 labels and and top five labels includes the correct object in this picture then we call this a success. So here is a result summary of the ImageNet Challenge, of the image classification result from 2010 to 2015 so on x axis you see the years and the y axis you see the error rate. So the good news is the error rate is steadily decreasing to the point by 2012 the error rate is so low is on par with what humans can do and here a human I mean a single Stanford PhD student who spend weeks doing this task as if he were a computer participating in the ImageNet Challenge. So that's a lot of progress made even though we have not solved all the problems of object recognition which you'll learn about in this class but to go from an error rate that's unacceptable for real-world application all the way to on par being on par with humans in ImageNet challenge, the field took only a few years. And one particular moment you should notice on this graph is the the year 2012. In the first two years our error rate hovered around 25 percent but in 2012 the error rate was dropped more almost 10 percent to 16 percent even though now it's better but that drop was very significant and the winning algorithm of that year is a convolutional neural network model that beat all other algorithms around that time to win the ImageNet challenge and this is the focus of our whole course this quarter is to look at to have a deep dive into what convolutional neural network models are and another name for this is deep learning by by popular popular name now it's called deep learning and to look at what these models are what are the principles what are the good practices what are the recent progress of this model, but here is where the history was made is that we, around 2012 convolutional neural network model or deep learning models showed the tremendous capacity and ability in making a good progress in the field of computer vision along with several other sister fields like natural language processing and speech recognition. So without further ado I'm going to hand the rest of the lecture to to Justin to talk about the overview of CS 231n. Alright, thanks so much Fei-Fei. I'll take it over from here. So now I want to shift gears a little bit and talk a little bit more about this class CS231n. So this class focuses on one of these most, so the primary focus of this class is this image classification problem which we previewed a little bit in the contex of the ImageNet Challenge. So in image classification, again, the setup is that your algorithm looks at an image and then picks from among some fixed set of categories to classify that image. And, this might seem like somewhat of a restrictive or artificial setup, but it's actual quite general. And, this problem can be applied in many different settings both in industry and academia and many different places. So for example, you could apply this to recognizing food or recognizing calories in food or recognizing different artworks, different product out in the world. So this relatively basic tool of image classification is super useful on its own and could be applied all over the place for many different applications. But, in this course, we're also going to talk about several other visual recognition problems that build upon many of the tools that we develop for the purpose of image classification. We'll talk about other problems such as object detection or image captioning. So the setup in object detection is a little bit different. Rather than classifying an entire image as a cat or a dog or a horse or whatnot, instead we want to go in and draw bounding boxes and say that there is a dog here, and a cat here, and a car over in the background, and draw these boxes describing where objects are in the image. We'll also talk about image captioning where given an image the system now needs to produce a natural language sentence describing the image. It sounds like a really hard, complicated, and different problem, but we'll see that many of the tools that we develop in service of image classification will be reused in these other problems as well. So we mentioned this before in the context of the ImageNet Challenge, but one of the things that's really driven the progress of the field in recent years has been this adoption of convolutional neural networks or CNNs or sometimes called convnets. So if we look at the algorithms that have won the ImageNet Challenge for the last several years, in 2011 we see this method from Lin et al which is still hierarchical. It consists of multiple layers. So first we compute some features, next we compute some local invariances, some pooling, and go through several layers of processing, and then finally feed this resulting descriptor to a linear SVN. What you'll notice here is that this is still hierarchical. We're still detecting edges. We're still having notions of invariance. And, many of these intuitions will carry over into convnets. But, the breakthrough moment was really in 2012 when Jeff Hinton's group in Toronto together with Alex Krizhevsky and Ilya Sutskever who were his PHD student at that time created this seven layer convolutional neural network now known as AlexNet, then called Supervision which just did very, very well in the ImageNet competition in 2012. And, since then every year the winner of ImageNet has been a neural network. And, the trend has been that these networks are getting deeper and deeper each year. So AlexNet was a seven or eight layer neural network depending on how exactly you count things. In 2015 we had these much deeper networks. GoogleNet from Google and VGG, the VGG network from Oxford which was about 19 layers at that time. And, then in 2015 it got really crazy and this paper came out from Microsoft Research Asia called Residual Networks which were 152 layers at that time. And, since then it turns out you can get a little bit better if you go up to 200, but you run our of memory on your GPUs. We'll get into all of that later, but the main takeaway here is that convolutional neural networks really had this breakthrough moment in 2012, and since then there's been a lot of effort focused in tuning and tweaking these algorithms to make them perform better and better on this problem of image classification. And, throughout the rest of the quarter, we're going to really dive in deep, and you'll understand exactly how these different models work. But, one point that's really important, it's true that the breakthrough moment for convolutional neural networks was in 2012 when these networks performed very well on the ImageNet Challenge, but they certainly weren't invented in 2012. These algorithms had actually been around for quite a long time before that. So one of the sort of foundational works in this area of convolutional neural networks was actually in the '90s from Jan LeCun and collaborators who at that time were at Bell Labs. So in 1998 they build this convolutional neural network for recognizing digits. They wanted to deploy this and wanted to be able to automatically recognize handwritten checks or addresses for the post office. And, they built this convolutional neural network which could take in the pixels of an image and then classify either what digit it was or what letter it was or whatnot. And, the structure of this network actually look pretty similar to the AlexNet architecture that was used in 2012. Here we see that, you know, we're taking in these raw pixels. We have many layers of convolution and sub-sampling, together with the so called fully connected layers. All of which will be explained in much more detail later in the course. But, if you just kind of look at these two pictures, they look pretty similar. And, this architecture in 2012 has a lot of these architectural similarities that are shared with this network going back to the '90s. So then the question you might ask is if these algorithms were around since the '90s, why have they only suddenly become popular in the last couple of years? And, there's a couple really key innovations that happened that have changed since the '90s. One is computation. Thanks to Moore's law, we've gotten faster and faster computers every year. And, this is kind of a coarse measure, but if you just look at the number of transistors that are on chips, then that has grown by several orders of magnitude between the '90s and today. We've also had this advent of graphics processing units or GPUs which are super parallelizable and ended up being a perfect tool for really crunching these computationally intensive convolutional neural network models. So just by having more compute available, it allowed researchers to explore with larger architectures and larger models, and in some cases, just increasing the model size, but still using these kind of classical approaches and classical algorithms tends to work quite well. So this idea of increasing computation is super important in the history of deep learning. I think the second key innovation that changed between now and the '90s was data. So these algorithms are very hungry for data. You need to feed them a lot of labeled images and labeled pixels for them to eventually work quite well. And, in the '90s there just wasn't that much labeled data available. This was, again, before tools like Mechanical Turk, before the internet was super, super widely used. And, it was very difficult to collect large, varied datasets. But, now in the 2010s with datasets like PASCAL and ImageNet, there existed these relatively large, high quality labeled datasets that were, again, orders and orders magnitude bigger than the dataset available in the '90s. And, these much large datasets, again, allowed us to work with higher capacity models and train these models to actually work quite well on real world problems. But, the critical takeaway here is that convolutional neural networks although they seem like this sort of fancy, new thing that's only popped up in the last couple of years, that's really not the case. And, these class of algorithms have existed for quite a long time in their own right as well. Another thing I'd like to point out in computer vision we're in the business of trying to build machines that can see like people. And, people can actually do a lot of amazing things with their visual systems. When you go around the world, you do a lot more than just drawing boxes around the objects and classifying things as cats or dogs. Your visual system is much more powerful than that. And, as we move forward in the field, I think there's still a ton of open challenges and open problems that we need to address. And, we need to continue to develop our algorithms to do even better and tackle even more ambitious problems. Some examples of this are going back to these older ideas in fact. Things like semantic segmentation or perceptual grouping where rather than labeling the entire image, we want to understand for every pixel in the image what is it doing, what does it mean. And, we'll revisit that idea a little bit later in the course. There's definitely work going back to this idea of 3D understanding, of reconstructing the entire world, and that's still an unsolved problem I think. There're just tons and tons of other tasks that you can imagine. For example activity recognition, if I'm given a video of some person doing some activity, what's the best way to recognize that activity? That's quite a challenging problem as well. And, then as we move forward with things like augmented reality and virtual reality, and as new technologies and new types of sensors become available, I think we'll come up with a lot of new, interesting hard and challenging problems to tackle as a field. So this is an example from some of my own work in the vision lab on this dataset called Visual Genome. So here the idea is that we're trying to capture some of these intricacies in the real world. Rather than maybe describing just boxes, maybe we should be describing images as these whole large graphs of semantically related concepts that encompass not just object identities but also object relationships, object attributes, actions that are occurring in the scene, and this type of representation might allow us to capture some of this richness of the visual world that's left on the table when we're using simple classification. This is by no means a standard approach at this point, but just kind of giving you this sense that there's so much more that your visual system can do that is maybe not captured in this vanilla image classification setup. I think another really interesting work that kind of points in this direction actually comes from Fei-Fei's grad school days when she was doing her PHD at Cal Tech with her advisors there. In this setup, they had people, they stuck people, and they showed people this image for just half a second. So they flashed this image in front of them for just a very short period of time, and even in this very, very rapid exposure to an image, people were able to write these long descriptive paragraphs giving a whole story of the image. And, this is quite remarkable if you think about it that after just half a second of looking at this image, a person was able to say that this is some kind of a game or fight, two groups of men. The man on the left is throwing something. Outdoors because it seem like I have an impression of grass, and so on and so on. And, you can imagine that if a person were to look even longer at this image, they could write probably a whole novel about who these people are, and why are they in this field playing this game. They could go on and on and on roping in things from their external knowledge and their prior experience. This is in some sense the holy grail of computer vision. To sort of understand the story of an image in a very rich and deep way. And, I think that despite the massive progress in the field that we've had over the past several years, we're still quite a long way from achieving this holy grail. Another image that I think really exemplifies this idea actually comes, again, from Andrej Karpathy's blog is this amazing image. Many of you smiled, many of you laughed. I think this is a pretty funny image. But, why is it a funny image? Well we've got a man standing on a scale, and we know that people are kind of self conscious about their weight sometimes, and scales measure weight. Then we've got this other guy behind him pushing his foot down on the scale, and we know that because of the way scales work that will cause him to have an inflated reading on the scale. But, there's more. We know that this person is not just any person. This is actually Barack Obama who was at the time President of the United States, and we know that Presidents of the United States are supposed to be respectable politicians that are [laughing] probably not supposed to be playing jokes on their compatriots in this way. We know that there's these people in the background that are laughing and smiling, and we know that that means that they're understanding something about the scene. We have some understanding that they know that President Obama is this respectable guy who's looking at this other guy. Like, this is crazy. There's so much going on in this image. And, our computer vision algorithms today are actually a long way I think from this true, deep understanding of images. So I think that sort of despite the massive progress in the field, we really have a long way to go. To me, that's really exciting as a researcher 'cause I think that we'll have just a lot of really exciting, cool problems to tackle moving forward. So I hope at this point I've done a relatively good job to convince you that computer vision is really interesting. It's really exciting. It can be very useful. It can go out and make the world a better place in various ways. Computer vision could be applied in places like medical diagnosis and self-driving cars and robotics and all these different places. In addition to sort of tying back to sort of this core idea of understanding human intelligence. So to me, I think that computer vision is this fantastically amazing, interesting field, and I'm really glad that over the course of the quarter, we'll get to really dive in and dig into all these different details about how these algorithms are working these days. That's sort of my pitch about computer vision and about the history of computer vision. I don't know if there's any questions about this at this time. Okay. So then I want to talk a little bit more about the logistics of this class for the rest of the quarter. So you might ask who are we? So this class is taught by Fei-Fei Li who is a professor of computer science here at Standford who's my advisor and director of the Stanford Vision Lab and also the Stanford AI Lab. The other two instructors are me, Justin Johnson, and Serena Yeung who is up here in the front. We're both PHD students working under Fei-Fei on various computer vision problems. We have an amazing teaching staff this year of 18 TAs so far. Many of whom are sitting over here in the front. These guys are really the unsung heroes behind the scenes making the course run smoothly, making sure everything happens well. So be nice to them. [laughing] I think I also should mention this is the third time we've taught this course, and it's the first time that Andrej Karpathy has not been an instructor in this course. He was a very close friend of mine. He's still alive. He's okay, don't worry. [laughing] But, he graduated, so he's actually here I think hanging around in the lecture hall. A lot of the development and the history of this course is really due to him working on it with me over the last couple of years. So I think you should be aware of that. Also about logistics, probably the best way for keeping in touch with the course staff is through Piazza. You should all go and signup right now. Piazza is really our preferred method of communication with the class with the teaching staff. If you have questions that you're afraid of being embarrassed about asking in front of your classmates, go ahead and ask anonymously even post private questions directly to the teaching staff. So basically anything that you need should ideally go through Piazza. We also have a staff mailing list, but we ask that this is mostly for sort of personal, confidential things that you don't want going on Piazza, or if you have something that's super confidential, super personal, then feel free to directly email me or Fei-Fei or Serena about that. But, for the most part, most of your communication with the staff should be through Piazza. We also have an optional textbook this year. This is by no means required. You can go through the course totally fine without it. Everything will be self contained. This is sort of exciting because it's maybe the first textbook about deep learning that got published earlier this year by E.N. Goodfellow, Yoshua Bengio, and Aaron Courville. I put the Amazon link here in the slides. You can get it if you want to, but also the whole content of the book is free online, so you don't even have to buy it if you don't want to. So again, this is totally optional, but we'll probably be posting some readings throughout the quarter that give you an additional perspective on some of the material. So our philosophy about this class is that you should really understand the deep mechanics of all of these algorithms. You should understand at a very deep level exactly how these algorithms are working like what exactly is going on when you're stitching together these neural networks, how do these architectural decisions influence how the network is trained and tested and whatnot and all that. And, throughout the course through the assignments, you'll be implementing your own convolutional neural networks from scratch in Python. You'll be implementing the full forward and backward passes through these things, and by the end, you'll have implemented a whole convolutional neural network totally on your own. I think that's really cool. But, we also kind of practical, and we know that in most cases people are not writing these things from scratch, so we also want to give you a good introduction to some of the state of the art software tools that are used in practice for these things. So we're going to talk about some of the state of the art software packages like Tensor Flow, Torch, [Py]Torch, all these other things. And, I think you'll get some exposure to those on the homeworks and definitely through the course project as well. Another note about this course is that it's very state of the art. I think it's super exciting. This is a very fast moving field. As you saw, even these plots in the imaging challenge basically there's been a ton of progress since 2012, and like while I've been in grad school, the whole field is sort of transforming ever year. And, that's super exciting and super encouraging. But, what that means is that there's probably content that we'll cover this year that did not exist the last time that this course was taught last year. I think that's super exciting, and that's one of my favorite parts about teaching this course is just roping in all these new scientific, hot off the presses stuff and being able to present it to you guys. We're also sort of about fun. So we're going to talk about some interesting maybe not so serious topics as well this quarter including image captioning is pretty fun where we can write descriptions about images. But, we'll also cover some of these more artistic things like DeepDream here on the left where we can use neural networks to hallucinate these crazy, psychedelic images. And, by the end of the course, you'll know how that works. Or on the right, this idea of style transfer where we can take an image and render it in the style of famous artists like Picasso or Van Gogh or what not. And again, by the end of the quarter, you'll see how this stuff works. So the way the course works is we're going to have three problem sets. The first problem set will hopefully be out by the end of the week. We'll have an in class, written midterm exam. And, a large portion of your grade will be the final course project where you'll work in teams of one to three and produce some amazing project that will blow everyone's minds. We have a late policy, so you have seven late days that you're free to allocate among your different homeworks. These are meant to cover things like minor illnesses or traveling or conferences or anything like that. If you come to us at the end of the quarter and say that, "I suddenly have to give a presentation "at this conference." That's not going to be okay. That's what your late days are for. That being said, if you have some very extenuating circumstances, then do feel free to email the course staff if you have some extreme circumstances about that. Finally, I want to make a note about the collaboration policy. As Stanford students, you should all be aware of the honor code that governs the way that you should be collaborating and working together, and we take this very seriously. We encourage you to think very carefully about how you're collaborating and making sure it's within the bounds of the honor code. So in terms of prerequisites, I think the most important is probably a deep familiarity with Python because all of the programming assignments will be in Python. Some familiarity with C or C++ would be useful. You will probably not be writing any C or C++ in this course, but as you're browsing through the source code of these various software packages, being able to read C++ code at least is very useful for understanding how these packages work. We also assume that you know what calculus is, you know how to take derivatives all that sort of stuff. We assume some linear algebra. That you know what matrices are and how to multiply them and stuff like that. We can't be teaching you how to take like derivatives and stuff. We also assume a little bit of knowledge coming in of computer vision maybe at the level of CS131 or 231a. If you have taken those courses before, you'll be fine. If you haven't, I think you'll be okay in this class, but you might have a tiny bit of catching up to do. But, I think you'll probably be okay. Those are not super strict prerequisites. We also assume a little bit of background knowledge about machine learning maybe at the level of CS229. But again, I think really important, key fundamental machine learning concepts we'll reintroduce as they come up and become important. But, that being said, a familiarity with these things will be helpful going forward. So we have a course website. Go check it out. There's a lot of information and links and syllabus and all that. I think that's all that I really want to cover today. And, then later this week on Thursday, we'll really dive into our first learning algorithm and start diving into the details of these things.
B1 US vision computer vision visual computer image object Lecture 1 | Introduction to Convolutional Neural Networks for Visual Recognition 156 10 李張誌 posted on 2018/03/23 More Share Save Report Video vocabulary