Subtitles section Play video Print subtitles Hi everyone, this is Alec and I'm going to be talking to you today about using recurrent neural networks for text analysis. To get an understanding of why this is a potential tool to use, it's good to look at a bit of a history of how text analysis has been done, particularly from a machine learning perspective. So in machine learning typically we're used to vector representation so we know how to deal with numbers. For categories we would use a one HOT vectorization model. But when we move to trying to understand and classify and regress sequences for instance, it becomes much less clear because our tools are typically based on vector approaches. The way this is typically dealt with is by computing some hard coded feature transformations, for instance, using TFIDF vectorizers, some sort of compression model like LSA and then plugging a linear model, such as support vector machine or a softmax classifier, on top of that. The purpose of this talk today is what happens if we cut out those techniques and instead replace them with an RNN. To get an understanding of why this might be an advantage, structure is hard. Ngrams are the typical way of preserving some structure. This would be we take our sentence, for instance, ‘the cat sat on the mat’, and we re-represent it as the occurrence of any individual word or combinations of words. These combinations of words begin to get us the way to see a little bit of structure. By ‘structure’ I mean preserving the ordering of words. The problem with this is once we have bigrams or trigrams any combination of two or three words quickly becomes huge possibilities. You can have easily 10 million plus features and this begins to get cumbersome, require lots of memory and slows things down in and of itself. Structure, although it’s difficult, is also very important. For certain tasks such as humor or sarcasm, looking at a collection of the word ‘cat’ appeared or the word ‘dog’ appeared isn’t going to cut it. And that’s what a lot of our models today do. To understand though why many models today are based on this and do quite successful, ngrams can get you a long way from any task. Specific words are often very strong indicators: ‘useless’ in the case of negative sentiment and ‘fantastic’ in the case of positive sentiment. If you’re, for instance, trying to classify whether a document is about the stock market or is a recipe…you don’t see the word ‘green tea’ come up very much in a stock market conversation and you don’t see the word ‘NASDAQ’ come up very much in a recipe. You can quickly separate things in those kinds of tasks. It’s often a question of knowing what’s right for your task at hand. If you’re trying to get at a more qualitative understanding of what’s going on in a body of text this is where structure may be very important. Whereas if you’re just trying to separate out something that may be very indicative on a word level, then in many ways a bag of words model can be quite strong. How an R&N works. To understand its potential advantages over a bag of words model what an RNN does is it reads through a sequence iteratively which is really nice because it’s how people do it as well. It’s able to preserve some of the structure of the model. It goes through each word and updates its hidden representation based on that word and the input from the previous hidden state. At time zero where we have no previous hidden state we feed in either a bunch of zeroes or we treat it as another parameter to be learned in [INAUDIBLE] representation. It just continues to do this all the way through the sequence. At each time step we have a 512, in this case we had 512 hidden units, dimensional vector representation of our sequence. It’s a way of taking the sequence of words and using time step converting it into fixed length representation. As a bit of notation, arrows would be projections dot products and boxes would represent activities. Vector is the values. For instance, the activation of each hidden unit would be this box. It just proceeds iteratively through. It’s important to note these projections are largely shared across all time sequences. This projection with this arrow shared for all inputs, across all time steps and this hidden to hidden unit connections are preserved as well across the sequence. This is what makes learning tractable in these models. At the end of iterating through the sequence we’ve got now a learned representation of the sequence, a vector form of our sequence which then can be used by slapping on a traditional classifier. In this toy example what we’ve done is we’ve read in an input sentence. For instance, we’re trying to teach the model how to classify the subject of a sentence. You can also stack them. Just as your RNN can go through an input sequence and return its internal representation of that sequence, you can then train another RNN on top of it or you can jointly train both. The structure is actually quite flexible. One final note is the way that we do this original input to hidden feed forward is this is typically represented either as a traditional one HOT which really doesn’t get us too much advantage but what’s really exciting is we can represent this as what's called an ‘embedding matrix’. These words, ‘the cat sat on the mat’, would be represented as indexes into a matrix. ‘The’ would be represented as index 100. When we read through the sequence we would look up the row, row 100, and we would return as the input being fed into the RNN the learned representation in that embedding matrix. Let’s say we had 128 dimensions to be learned as an input representation for our words. It would be equivalent to 128 by let’s say, 10,000 matrix if we do 10,000 words. We would then feed this in as input. That’s really cool because we can treat it as a learned w ay to learn representations of our words. We'll look later in this presentation at what those actually look like. They give the model a lot of power. The big thing in the literature is RNNs have a reputation for being very difficult to learn. They are often known to be unstable in simple art in strange with generic stochastic gradients are actually very unstable and difficult to learn. What has happened in the research literature over the last few years is there are a bunch of tricks that have been developed that help them be much more stable, much more powerful and much more reliable, effectively. To get an understanding of these we’re going to go quickly through all these various tricks. The first of these is gating units. To understand what a gating unit is we first need to look a little bit more into detail how a simple RNN works. What happens is we have our hidden state from our previous time step. Again, at the original time step this can just be zeroes or parameters and we receive input at time step T. We take input from the hidden state of h of T minus one and we take input of T. We just add them together for instance via a dot product projection and then we apply an element wise activation function like [INAUDIBLE] for instance. Then we have a new hidden state. At the next time step we receive more input, we add it together and apply another element wise activation function and this process continues forward. To understand what the problem that can be with this, is information is always being updated at each time step. As a result it becomes difficult for information to persist through a model like this. You can think of this as a form of exponential decay. If we have a value let's say here, of one and through this process we effectively end up multiplying that value by a reasonable .05 what happens after the course of several time steps is that value will exponentially decay for instance, to zero. Information has difficulty spreading through a structure like this. There have been various changes called ‘gating units’ to make this work better. What a gating unit does is instead of having the hidden state at a new time step be a direct operation of the previous time step, it adds in a variety of gating units that effectively transform the information in a more structured way. One of these is called the ‘gated recurrent unit’ introduced recently. What it does is it uses two types of gates. It uses a reset gate and a dynamics gate. This is the reset gate and this is the dynamics gate. What the reset gate does is it takes an input from a previous time step, both the hidden representation and the input, and it computes a…there should be element wise sigmoid squash in here, and what it does is it basically computes how much of the previous time steps’ information should continue along this route. A reset value could be anywhere between zero and one and what it does is it multiplies the previous hidden states by those reset states. What this does is it allows a model to adaptively forget information. For instance, you can imagine for sentiment analysis that some of our information might be only relevant on the sentence level and once you see a period your model can then clear some of its information because it knows the sentence is over and then be able to use it again. Once we have the previous times states information effectively gated by a reset gate, we then update and get our potential new hidden state, h~t. What we then do is we use the dynamics gate to instead of just using h~t as would be a somewhat similar model to the previous example, we average it effectively with the previous hidden state based on the dynamics gate. We take the output of h~t, multiply it by Z again is like our computes values between and one for each unit and we multiply it and add together. This effectively [INAUDIBLE] If ‘Z’ is zero what it would do is it would take entirely the new updated h~t and none of the previous hidden state. This would be equivalent in many ways to our previous simple recurrent unit. In that case we would just have the new value come through. Sure it would be gated by the recurrent unit but again the new hidden state would be a completely new update compared to the previous hidden state. Whereas if Z is one and this value here is one and this value here is zero so then our new hidden state is just a copy effectively of the previous hidden state. Z values of near one are ways of effectively propagating information over longer steps of time. You can think of it as the easiest way to remember something is just not to change it. A value can be spread over very long periods of time if we just have Z values near one because that way we don't have all this noise of updating our hidden states. We just lock it and let that value persist. That’s a way to, for instance, on the sentence level keep context from the previous sentence around if we had Z values that were locked on. Again, like in our previous model, this can be expanded to another time step just so you see how information flows. Again, there are all these calculations involved in updating our hidden states and our gating values but the information at its core really flows through this upper loop of gated values interacting with previous hidden states. Gating is essential. That was enough of an example of all the theoretical reasons why these better designed gates might help propagate information better. But empirically it's also very important. For sentiment analysis of longer sequence of text, for instance, a paragraph or so, a few hundred words for instance, a simple RNN has difficulty learning it all. You can see that it initially climbs downhill a little bit but all it’s actually doing here is just predicting the average sentiment, for instance, 0.5. Whereas a gated unit, a recurrent neural network, is able to quickly learn and continuously learn. Again, you can't use simple recurrent units for these more complex tasks especially when you have longer sequences of 100 plus words or tokens. They just don't work well because information is hard to keep over longer sequences of time in those kinds of models. Now that we've talked about gating there's another question which is what kind of gating do you use. There are two types of models that have been proposed. Gated recurrent units by Cho, recently from the University Montreal, which are used for machine translation and speech recognition tasks. Then there's also with the more traditional long short term memory. This has been around much longer and has been used in far more papers. Various modifications to the classic architecture exist but for text analysis GRU seems to be quite nice in general. It seems to be simpler, faster and optimizes quicker at least on the sentiment analysis dataset. Because it only has two gates compared to LSTM’s four it’s also a little bit faster. If you have a larger dataset and you don't mind waiting a little bit longer, LSTM may be better in the long run especially with larger datasets because it has additional complexity with more gates. But again it seems like GRU does quite well in these kinds of problems generally. I tend to favor it myself but you can try both. The library we’ll be introducing later in this talk supports both. The next question is exploding gradients. Exploding gradients are a training dynamics phenomena that happens in recurrent neural networks where the values that we’re trying to update [INAUDIBLE] at each step of our training algorithm can become very large and very unstable. This is one of the sources of the reputation of RNNs being hard to train. Typically you would see small values, for instance the norm of your gradient would be around one and just bouncing around and then sometimes you’d see huge spikes. Those spikes can be quite damaging because a traditional learning update would then rapidly change your values and this could result in unstable oscillations and your whole model explodes. In 2012 there was a great paper that proposed simply clipping the norm of the gradient. If the gradient exceeded a set value, for instance 15, it would just be reset and scaled to that value. This was a common form of making RNNs much more stable. Interestingly though, at least on text analysis for sentiment, we don't seem to see this problem with modern optimizers. It seems that the gradient decays pretty cleanly and becomes quite stable over the course of learning. There's another way of making recurrent neural networks better and this is by using better gating functions. There was an interesting paper this year at NIPS the basic idea of which was let's make our gates steeper so they change more rapidly from being a value of zero to a value of one. What this means is a traditional sigmoid would change pretty smoothly between negative five and five. But when you randomly initialize one of these numbers at the beginning of training typically your values wouldn’t lie along the average 0.5, for instance. You wouldn’t see much dynamics here. If we make our gate steeper what that means is our gates begin to rapidly switch between zero and one much more easily, particularly near the beginning of learning. What this seems to suggest is that models that have used these steeper gating units tend to learn a bit faster because they begin to learn how to use these gates quicker. This is another quick easy technique to add. Again, the library we’ll be introducing later in this talk supports to help make learning better in these models. Another technique is orthogonal initialization. Andrew Saxe last year did some great work on showing that initializing. When we begin training these models we don’t know the values of these parameters to use in these dot products, for instance; the weight matrices effectively. What the research literature typically does is initialize [INAUDIBLE] for instance, random Gaussian or random uniform noise. What this research showed is that using random orthogonal matrices worked much better. It's in line with some previous other work that has also noted various forms of similar initializations worked well for RNNs. Now we want to understand how we train these models. There are a variety of techniques that can be used. This is a visualization of the training dynamics of various algorithms on toy datasets where we’re trying to classify these red dots from these blue dots. We only have a linear model so all it can do is learn effectively a line separating these two. It can't do it perfectly because there's always going to be values separating this. What we see is that the traditional most basic optimizer is stochastic gradient descent whereas there are these various other improvements and techniques. The main point of this example is to demonstrate to not use Sgd effectively. Sgd very early on in training can look quite similar but once the norm of your gradients becomes slower due to later stages optimization you want some sort of dynamicism to your learning algorithm whereas Sgd once it gets out of the very steep earlier areas of learning tends to slow down. This is particularly a problem oftentimes in the space of text analysis because we have very sparse updates on words, for instance. There are rare words that you only see once every thousand or 100,000 words and those words are very difficult to learn in a traditional Sgd framework. Whereas these various techniques like momentum and [INAUDIBLE] accelerated gradient what they do is effectively average together multiple updates and accumulate those averages. They’re a form of smoothing out this stochastic noise and accelerating directions of continuous updates. There's another family of acceleration methods; the adafamily that effectively scale the learning rate, the amount by which we update a parameter given a gradient by some dynamics, some heuristics describing the local gradient. In the case of adagrad what we do is we accumulate the norm of the gradients update seen so far with respect to a parameter and we scale our learning rate. It's a form of learning rate where we can see that early on it learns quite quickly and later on it begins to slow down as it reaches in this case near [INAUDIBLE]. Adadelta an RMS prop do something a little bit like that but make it dynamic. It’s based on the local history instead of the global history of the gradients for a parameter. There are a variety of optimizers and one recently introduced called ‘Adam’ combines the early optimization speed that we saw in that earlier example of adagrad with the better later convergence of various other methods like adadelta and RMS prop. This looks quite good for text analysis in RNN. We can see that Adam gets off to a very early learning start just like adagrad. These results…actually there's a slight bug in my code for this so take them with a grain of salt but they still look good and it's a bug in the code so it might still be okay. That might actually explain one of the reasons why we saw slightly worse generalization performance. It would train quite well but we would see its performance on held out data might not have been as good for Adam because it learned so much more quickly. We’re still looking into reasons why this happens but in general, modern optimizers are essential on these kinds of problems. This just gives you a background on all the various techniques for making RNNs more efficient in training and it can add quite a lot. Early on in learning we can see that Adam and all these other techniques added together so this would be a just a standard gating RNN. Again, if we had a simple RNN on here it would look pretty linear. If we add gradient clipping to make it more stable so we can use a slightly larger learning rate it begins to learn faster. If we add orthogonal initialization we can see again that it began to learn faster and learn better. Finally, if we had only Adam we see another huge gain over traditional Sgd. These add up. We can see that Adam and all these other techniques are able to reach lower effective minima and are at least faster. Up to 10x faster. Admittedly these techniques add a little bit of computation time so it might only be for instance 7.5x faster on a wall clock compared to efficiency per rate update. This is interesting because now RNNs can actually overfit quite a lot. As they continue to fit to training data for instance their test data might plateau. We continue to improve on the training dataset we’re given but this is called 'overfitting' where our RNN is effectively optimizing for the details of the training data that aren’t true of new data. To combat this one of the techniques that is used is called ‘early stopping’ which is each iteration of our dataset we will record the train and test validation training test scores of these models and we will stop once we notice that our test validation performances are improving. Oftentimes this is going to occur in your first or second iteration through the dataset with all these various techniques together. That's good news because oftentimes models in this space can take ten, 50 or 100 iterations for your training data to converge. It seems in the case of RNNs we often overfit after one or two epics through the data. To understand and get a better sense of how these models can do we’re going to compare them to a much more standard technique in the literature. We’re going to use the Fantastic Machine Learning Library, SKLearn, and we’re going to use a standard linear model approach, a traditional approach to text analysis. This would be using at TFIDFI vectorizer and a linear model such as logistic regression. This is by no means meant to be the best model. In many cases, naïve Bayes SVN is actually better than [INAUDIBLE] regression for classification for instance, but this is just a very easily accessible, very easily comparable to technique To be fair, we're going to use bigrams which is a way of getting a little bit of structure into our data. Again, this way we could see ‘not good’ instead of just seeing the tokens for ‘not’ and ‘good’ occurring. We can get a little bit of structure which might be useful in sentiment analysis. We’re going to use grid search to evaluate potential [INAUDIBLE] for these linear models. We’re going to look at two which is minimum document frequency which is way of controlling for the size of our input to our linear model. This would take tokens or words that appear less than, in less than and many documents and would ignore them. If we see, for instance, the word ‘dinosaur’ and we've only seen it once in our dataset we're going to ignore it effectively. Also we're going to look at the regulization coefficient which is a way of preventing overfitting for the new models. What we’re doing is grid search so we're looking at potential values for both of these. We're not just explaining a potential performance improvement based on poorly fitted parameters. Because these linear models tend to be faster we are able to more effectively search over potential parameters. This is a fair way to get the linear model potential advantage because they're much faster so we can much more quickly search through multiple values. Our second model we’re going to be looking at is one of these recurrent neural networks. Admittedly, this is our own personal research: take every result with a grain of salt. I'm using whatever I’ve tried that worked. The general message though is that using a modern optimizer such as Adam, a gated recurrent unit, steeper sigmoid gates and orthogonal initialization are good defaults. A medium-size model that can work quite well is a 256 dimensional embedding and a 512 dimensional hidden representation. Then we put on whatever output we need: logistic regression for binary sentiment classification, linear regression for predicting real values, etc. It's quite flexible because the RNN in its core is a way of taking these sequences of values and converting them into a vector. Once we’ve got that vector we can put whatever traditional model we want on top of it so long as it's differential and open to gradient based training. How does this work on datasets? What we see quickly here is that our linear regression model does incredibly well for smaller datasets. When we have for instance only 1,000 or 10,000 training examples we see that the linear model outperforms the RNN by 50%, for instance. But what we notice that’s interesting as our datasets get bigger the RNN tends to scale better until till later training into larger dataset sizes. Because the RNN is admittedly a much more complex model and operates on the sequences themselves ideally with more training data it can learn a much better way to do the task at hand. Whereas your linear model because it's operating on unstructured bag of words and is just a linear model might eventually hit a wall where it's not able to do any better. You can imagine certain situations that you just aren't going to be able to classify the sentiment, positiveness or negativeness of a text when it uses double negation, for instance. That’s one example with sentiment analysis. What’s also interesting is we see this replicated for instance for predicting the helpfulness of a customer review. This is interesting because this is a much more qualitative thing. Sentiment is as well but how helpful a user’s review of a product is even more getting a much more abstract concept. We see again that as before with small amounts of data the linear model, in this case reg since we're predicting real values, does much better but it doesn't seem to scale and make use of more data as effectively as an RNN. This is interesting. We can see that RNNs seem to have poor generalization properties with small amounts of data but they seem to be doing better when we have large amounts of data. At one million labeled examples we can often be between zero and 30% better than the equivalent linear model. Again these are just these examples with logistic regression and linear regression but that crossover seems to be robust and somewhere between 100,000 and a million examples but it is dependent on the dataset. There's only one unfortunate caveat to this approach which is it’s quite slow. For a million paragraph size text examples to converge that linear model takes about 30 minutes on a single CPU core. For an RNN if we use a high-powered graphics card such as the GTX 980 it takes about two hours. That’s not too bad. Our RNN on a proper high-end graphics card is only about four times slower at a million examples to converge than the linear model. Again this is on a basic CPU core. But if we train our RNN on just that CPU core it takes five days. This is unfortunate because this means our RNN is about 250 times slower than a CPU and that's just not going to cut it. This effectively is why we use GPUs in this research. Here's the cool part of the presentation. Again, an RNN when it's being fed an input sequence takes in the sequence and effectively learns a representation for each word. Each word gets replaced from its identifier some value like ‘the’ is token 100 and gets replaced with a vector representation that is learned by our model. These visualizations we’ll be showing you are what happen when you look at what representations are learned by those models. What we're going to do is use an algorithm called ‘TSNE’ to visualize these embeddings that our RNN learns. What we've done to make it a little clearer is this is the representations learned from training on only binary sentiment analysis. We’re trying to predict whether a given customer review, for instance, likes a product or doesn't like a product. What we've done is we've visualized these representations in two dimensions using TSNE and we've colored each word by the average sentiment of a review it appears in. What we see is a kind of axis. Again, it doesn't correspond to any actual axis aligned because it's TSNE. But we see this continuum between very negative words and very positive words. This isn't too surprising. A model trained on sentiment analysis learns to separate out negative and positive words. That's what you'd expect to happen. We can take a little look at these very positive and very negative clusters and see that it's grouped into very understandable words like ‘useless, waste, poorly, disappointed’ as negative. You can see some interesting stuff where again this visualization tries to group similar things close together. We can see that it’s actually identified even though it is a very negative grouping, it's also identified ‘returned, returning, returns, return’ all together as well. That’s interesting because it seems to know that ‘returned’ and return related words are very negative unsurprisingly if you find them in a review. But it's also separated them out slightly from other more generic words. Then on the positive side we also see very unsurprising indicators of happy if sentiments. So ‘fantastic, wonderful, and pleased’. But what's even more interesting about this model is that we see other forms of grouping and structure being learned. We see that it pulls out for instance, quantities of time; weeks, months, hours, minutes. We also see that it pulls out qualifiers like ‘really, absolutely, extremely, totally’. Again, qualifiers are interesting because they are by themselves neutral. They don't necessarily indicate positive or negative sentiment; instead they modify it. You can have ‘extremely good’ and ‘extremely bad’. You see that being pulled out together. You also see product nouns, for instance things that products could be, things that are products like movies, books, stories, items, devices are also grouped together. Additionally, punctuation is grouped together. This is indicative potentially of our model learning to use these kinds of data which again implies that our model may actually be learning to use some of the structure present in the data. Punctuation by grouping it together and learning similar representations for it imply that it's finding some use for it. We would expect again punctuation to be quite useful for segmenting out and separating out meanings and notions. Quantities of time are interesting. They are slightly negatively associated which is understandable when you talk about, ‘this product took months to show up’ or, ‘it worked for a total of an hour’. Again, grouping them all together implies some use of it and the same thing with qualifiers. We have no true evidence at least in this picture of these words being used but by learning similar representations and by having them grouped together it implies it's finding a use for them. We can extrapolate from there that it may in fact be learning to use these words in natural ways for sentiment analysis. Again, this is learned purely from zero and one binary indicator variables. This is a bit like seeing a sequence of numbers, 1,000, 2,000, 3046, five and then realizing that tokens five and one thousand are exclamation point and period. They're similar to tokens 2,000 and 7,000 which are comma and colon. This is a very strong result and very interesting to see this kind of similarity being learned by our model. This is cool but how can we actually use these models? We're also presenting today a basic library to allow developers to use these recurrent neural networks for text analysis. It's called Passage and it’s a tiny library built on top of the great Theano machine learning for framework [INAUDIBLE] math library. It's incredibly alpha; we're working on it but it has a variety of features. We're going to walk through now an example of how to use Passage. This is Passage. It’s clonable via GitHub and it has a variety of tools to make this useful. This is a little example we're going to walk through and explain real quickly on how we can use Passage to do analysis of text. We need to import the components that are necessary. One of these is the tokenizer which is a way of taking strings of text and separating them out into the individual tokens which would be words and punctuation, for instance. A tokenizer can just be instantiated. It has a variety of parameters but has sensible defaults. What we do is we emulate a SKLearn style interface. We can call fit transform on a body of training text which would be again a list of strings for instance and that would return a list of these training tokens which can be used natively by Passage to train RNN models. Additionally, we're going to import the various layers of a model. We have that embedding matrix we talked about, the gated recurrent unit, and a dense output classifier. The way that we compose these into a training model is by stacking them together in a list. Our input is one of these embedding matrices and we're going to set it to have 128 dimensions. We need to know how many of these features to learn, how many of these tokens there are, and we're going to just pull that out of how many our tokenizer decided we needed. Then we're going to use one of these gated recurrent layers where in this case setting its size to 128. The sizes are sometimes smaller than you would use for actual models and you can see better performance from larger models, for instance, but these are small enough to be run on a CPU and not take forever. They’ll still take quite a while though. Finally, we have our dense output unit which would be if we were doing binary sentiment classifications, detecting if it's negative or positive for a string of text, would be one unit because we’d be predicting one value. We would use a sigmoid activation as a way of quickly separating out negative and positive values. Then to make this model we instantiate it through the model class which is just importable from Passage dot models RNN. We give it the layers we want to build our custom architecture out of and we tell it what cost function we want to optimize. The cost function is the effective function that lets us train this model. It's just a way of telling the model how good was this, how good did you do on this example, effectively. For binary classification we use binary [INAUDIBLE] in this example. To train this model we just call a fit interface which takes in training tokens which are made from training text and also takes in the training labels we want to predict given those training texts. Then once that model has been trained…It should be noted this only transfers one iteration through your dataset. As mentioned earlier, you may want to train for multiple iterations if for instance your model hasn’t converged and you may want to measure your performance on hold out data to know when to stop training a model if it begins to overfit. Right now we've left that part to you but we will be extending this to have interfaces to automatically do this. Finally, if you want to have your model then predict on new data you can just call model dot predict on tokenizer dot transform or test text and this will return how the model predicts new data. That's an example of how to use Passage. To summarize, RNNs are now a potentially competitive tool in certain situations for text analysis. Admittedly there are a lot of disclaimers there but there seems to be a general trend that seems to be emerging which is if you have a large, for instance, million-plus example dataset and you have a GPU they can look quite good. They potentially can outperform linear models and might not take all that much longer. But if you have a smaller dataset and don't have a GPU it can be very difficult to justify despite how cool these models might seem compared to linear models. They are a lot slower, they have a lot of complexity, a lot of different parts and a lot of different architectures you can change and they seem to have poor generalization results with small datasets. Thanks for listening. If you have any questions you can let me know at Alec at Indico dot I-O. Also if you'd like to see a more general introduction to machine learning and deep learning in Python I have another video that you can check out in the upper right, introducing that as a Python developer, how to use the awesome Theano library to implement these algorithms yourself. Additionally if you'd like to check out and learn more about Indico feel free to visit our website at Indico dot I-O where we have various tools like Passage available for developers to use for machine learning. Thanks.
B1 model instance linear hidden recurrent sentiment General Sequence Learning using Recurrent Neural Networks 217 16 araratlee posted on 2016/01/15 More Share Save Report Video vocabulary