{"font_size":0.4,"font_color":"#FFFFFF","background_alpha":0.5,"background_color":"#9C27B0","Stroke":"none","body":[{"from":5.21,"to":7.95,"location":2,"content":"Hi, everyone. I'm Abby,"},{"from":7.95,"to":9.54,"location":2,"content":"I'm the head TA for this class"},{"from":9.54,"to":12.51,"location":2,"content":"and I'm also a PhD student in the Stanford NLP group."},{"from":12.51,"to":14.67,"location":2,"content":"And today I'm gonna be telling you about"},{"from":14.67,"to":17.04,"location":2,"content":"language models and recurrent neural networks."},{"from":17.04,"to":19.98,"location":2,"content":"So, here's an overview of what we're gonna do today."},{"from":19.98,"to":24.89,"location":2,"content":"Today, first, we're going to introduce a new NLP task, that's language modelling,"},{"from":24.89,"to":29.5,"location":2,"content":"and that's going to motivate us to learn about a new family of neural networks,"},{"from":29.5,"to":32.73,"location":2,"content":"that is recurrent neural networks or RNNs."},{"from":32.73,"to":34.44,"location":2,"content":"So, I'd say that these are two of"},{"from":34.44,"to":37.41,"location":2,"content":"the most important ideas you're going to learn for the rest of the course."},{"from":37.41,"to":41.07,"location":2,"content":"So, we're going to be covering some fairly cool material today."},{"from":41.07,"to":44.3,"location":2,"content":"So, let's start off with language modeling."},{"from":44.3,"to":48.77,"location":2,"content":"Language modeling is the task of predicting what word comes next."},{"from":48.77,"to":52.23,"location":2,"content":"So, given this piece of text the students opens their blank,"},{"from":52.23,"to":58.37,"location":2,"content":"could anyone shout out a word which you think might be coming next?"},{"from":58.37,"to":59.55,"location":2,"content":"Purpose. [NOISE]."},{"from":59.55,"to":63.84,"location":2,"content":"[OVERLAPPING] Mind, what else? I didn't quite hear them,"},{"from":63.84,"to":66.72,"location":2,"content":"but, uh, yeah, these are all likely things, right?"},{"from":66.72,"to":68.15,"location":2,"content":"So, these are some things which I thought,"},{"from":68.15,"to":69.49,"location":2,"content":"students might be opening, uh,"},{"from":69.49,"to":71.2,"location":2,"content":"students open their books, seems likely."},{"from":71.2,"to":73.28,"location":2,"content":"Uh, students open their laptops,"},{"from":73.28,"to":75.04,"location":2,"content":"students open their exams,"},{"from":75.04,"to":76.58,"location":2,"content":"Students open their minds, incredibly,"},{"from":76.58,"to":78.7,"location":2,"content":"someone came up with one, that one just now,"},{"from":78.7,"to":80.33,"location":2,"content":"uh, it's kind of a metaphorical meaning of opening."},{"from":80.33,"to":83.67,"location":2,"content":"So, you are all performing language modeling right now."},{"from":83.67,"to":85.47,"location":2,"content":"And thinking about what word comes next,"},{"from":85.47,"to":87.25,"location":2,"content":"you are being a language model."},{"from":87.25,"to":91.39,"location":2,"content":"So, here's a more formal definition of what a language model is."},{"from":91.39,"to":94.43,"location":2,"content":"Given a sequence of words X1 up to Xt,"},{"from":94.43,"to":97.34,"location":2,"content":"a language model, is something that computes"},{"from":97.34,"to":101.2,"location":2,"content":"the probability distribution of the next word, Xt plus 1."},{"from":101.2,"to":104.19,"location":2,"content":"So, a language model comes up with the probability distribution,"},{"from":104.19,"to":109.07,"location":2,"content":"the conditional probability, of what X t plus 1 is given the words it found."},{"from":109.07,"to":110.81,"location":2,"content":"And here we're assuming that, Xt plus 1"},{"from":110.81,"to":113.96,"location":2,"content":"can be any word w from a fixed vocabulary V."},{"from":113.96,"to":115.52,"location":2,"content":"So we are assuming that there is"},{"from":115.52,"to":118.2,"location":2,"content":"a pre-defined list of words that we're considering."},{"from":118.2,"to":120.14,"location":2,"content":"In this way, you can view language modeling"},{"from":120.14,"to":121.85,"location":2,"content":"as a type of classification task,"},{"from":121.85,"to":124.58,"location":2,"content":"because there's a predefined number of possibilities."},{"from":124.58,"to":129.85,"location":2,"content":"Um, we call a system that does this a language model."},{"from":129.85,"to":132.05,"location":2,"content":"There's an alternative way of thinking"},{"from":132.05,"to":133.72,"location":2,"content":"about a language model as well."},{"from":133.72,"to":135.2,"location":2,"content":"You can think of a language model"},{"from":135.2,"to":139.06,"location":2,"content":"as a system which assigns probability to a piece of text."},{"from":139.06,"to":141.47,"location":2,"content":"So, for example, if we have some piece of text,"},{"from":141.47,"to":143.18,"location":2,"content":"X up to X capital T,"},{"from":143.18,"to":145.04,"location":2,"content":"then, the probability of this text"},{"from":145.04,"to":147.83,"location":2,"content":"according to the language model can be broken down."},{"from":147.83,"to":149.25,"location":2,"content":"So, just by definition,"},{"from":149.25,"to":151.25,"location":2,"content":"you can say that the probability is equal to,"},{"from":151.25,"to":154.53,"location":2,"content":"the product of all of these conditional probabilities."},{"from":154.53,"to":157.38,"location":2,"content":"And, uh, the form inside,"},{"from":157.38,"to":160.48,"location":2,"content":"the products is exactly what a language model provides."},{"from":160.48,"to":162.47,"location":2,"content":"So, you can think of these things as somewhat equivalent."},{"from":162.47,"to":164.72,"location":2,"content":"Predicting next words, gives you a system,"},{"from":164.72,"to":169.27,"location":2,"content":"that can give the probability of a given piece of text."},{"from":169.27,"to":172.75,"location":2,"content":"So, in fact, you, use language models every day."},{"from":172.75,"to":176.12,"location":2,"content":"For example, when you're texting on your phone and you're writing a message,"},{"from":176.12,"to":177.61,"location":2,"content":"then most likely if you have a smartphone,"},{"from":177.61,"to":180.08,"location":2,"content":"it will be predicting what word you might be about to say."},{"from":180.08,"to":181.79,"location":2,"content":"So, if you say, um, I'll meet you at the-"},{"from":181.79,"to":184.25,"location":2,"content":"your phone might suggest perhaps you mean airport or cafe,"},{"from":184.25,"to":186.03,"location":2,"content":"or office, for example."},{"from":186.03,"to":188.91,"location":2,"content":"Another situation which you use language models every day"},{"from":188.91,"to":191.5,"location":2,"content":"is when you search for something on the internet, for example, Google,"},{"from":191.5,"to":192.83,"location":2,"content":"and you start typing your query,"},{"from":192.83,"to":195.95,"location":2,"content":"then Google tries to complete your query for you, and that's language modeling."},{"from":195.95,"to":200.48,"location":2,"content":"It's predicting what word or words might come next."},{"from":200.48,"to":203.72,"location":2,"content":"So, that's what a language model is,"},{"from":203.72,"to":206.68,"location":2,"content":"and the question is, how would you learn a language model?"},{"from":206.68,"to":209.92,"location":2,"content":"So, if I was to ask that question in the pre- deep learning era,"},{"from":209.92,"to":211.73,"location":2,"content":"which was really only a few years ago,"},{"from":211.73,"to":215,"location":2,"content":"the answer would be, you would learn a n-gram language model."},{"from":215,"to":218.57,"location":2,"content":"So, today first we're going to learn about n-gram language models."},{"from":218.57,"to":221.33,"location":2,"content":"So, before I can tell you what a n-gram language model is,"},{"from":221.33,"to":223.16,"location":2,"content":"you need to know what an n-gram is."},{"from":223.16,"to":227.91,"location":2,"content":"So, by definition an n-gram is a chunk of n consecutive words."},{"from":227.91,"to":230.4,"location":2,"content":"So, for example, a one gram or unigram,"},{"from":230.4,"to":232.05,"location":2,"content":"is just all of the individual words"},{"from":232.05,"to":235.02,"location":2,"content":"in the sequence that would be \"the students open the-\""},{"from":235.02,"to":238.81,"location":2,"content":"A two gram or bigram would be all of the consecutive chunks of pairs of words,"},{"from":238.81,"to":240.98,"location":2,"content":"\"the students\", \"students opened\", \"opened their\""},{"from":240.98,"to":245.05,"location":2,"content":"and so on for trigrams and four-grams, etc."},{"from":245.05,"to":248.57,"location":2,"content":"So, the core idea of an n-gram language model"},{"from":248.57,"to":251.03,"location":2,"content":"is that in order to predict what word comes next,"},{"from":251.03,"to":252.81,"location":2,"content":"you're going to collect a bunch of statistics,"},{"from":252.81,"to":254.93,"location":2,"content":"about how frequent different n-grams are,"},{"from":254.93,"to":256.49,"location":2,"content":"from some kind of training data,"},{"from":256.49,"to":258.11,"location":2,"content":"and then you can use those statistics"},{"from":258.11,"to":261.83,"location":2,"content":"to predict what next words might be likely."},{"from":261.83,"to":263.64,"location":2,"content":"Here is some more detail."},{"from":263.64,"to":266.32,"location":2,"content":"So, to make an n-gram language model,"},{"from":266.32,"to":268.49,"location":2,"content":"first you need to make a simplifying assumption,"},{"from":268.49,"to":270.31,"location":2,"content":"and this your assumption."},{"from":270.31,"to":273.35,"location":2,"content":"You say that the next word Xt plus 1"},{"from":273.35,"to":277.54,"location":2,"content":"depends only on the preceding N-1 words."},{"from":277.54,"to":279.9,"location":2,"content":"So, what we're assuming,"},{"from":279.9,"to":281.65,"location":2,"content":"is that the probability distribution,"},{"from":281.65,"to":285.02,"location":2,"content":"the conditional probability of Xt plus 1 given all of the words they follow,"},{"from":285.02,"to":286.16,"location":2,"content":"we're just going to simplify that,"},{"from":286.16,"to":290.49,"location":2,"content":"and say it only depends on the last N-1 words, and that's our assumption."},{"from":290.49,"to":293.95,"location":2,"content":"So, by the definition of conditional probability,"},{"from":293.95,"to":295.6,"location":2,"content":"we can say that this probability,"},{"from":295.6,"to":298.38,"location":2,"content":"is just the ratio of two different probabilities."},{"from":298.38,"to":301.18,"location":2,"content":"So, on the top, you've got the probability of"},{"from":301.18,"to":303.22,"location":2,"content":"a particular n-gram and on the bottom we've"},{"from":303.22,"to":306.19,"location":2,"content":"got the probability of a particular N-1 gram"},{"from":306.19,"to":308.02,"location":2,"content":"This is a little hard to read because of all the superscripts"},{"from":308.02,"to":311.01,"location":2,"content":"but I'm gonna give an example with words on the next slide."},{"from":311.01,"to":315.06,"location":2,"content":"Okay. So, that's the definition of the probability of the next word,"},{"from":315.06,"to":317.14,"location":2,"content":"but the question remains, how do we get all of"},{"from":317.14,"to":319.98,"location":2,"content":"these n-gram and N-1 gram probabilities?"},{"from":319.98,"to":322.3,"location":2,"content":"So, the answer is, we're going to get them by"},{"from":322.3,"to":325.05,"location":2,"content":"counting them in some large corpus of text."},{"from":325.05,"to":326.51,"location":2,"content":"So, we're going to approximate,"},{"from":326.51,"to":329.56,"location":2,"content":"these probabilities just by the count of the number of times that"},{"from":329.56,"to":334.41,"location":2,"content":"these particular n-grams and N-1 grams appeared in our training corpus."},{"from":334.41,"to":337.37,"location":2,"content":"Okay. So, here's an example with some words."},{"from":337.37,"to":340.56,"location":2,"content":"Suppose we are trying to learn a 4-gram language model,"},{"from":340.56,"to":342.83,"location":2,"content":"and suppose that we have a piece of text, that says,"},{"from":342.83,"to":344.54,"location":2,"content":"\"As the proctor started the clock,"},{"from":344.54,"to":346.1,"location":2,"content":"the students opened their blank\","},{"from":346.1,"to":348.89,"location":2,"content":"and we're trying to predict what word is coming next."},{"from":348.89,"to":351.74,"location":2,"content":"So, because we're learning a 4-gram language model,"},{"from":351.74,"to":355.91,"location":2,"content":"a simplifying assumption is that the next word depends only on the last three words,"},{"from":355.91,"to":357.61,"location":2,"content":"last N-1 words."},{"from":357.61,"to":361.52,"location":2,"content":"So, we're going to discard all of the context so far except for the last few words,"},{"from":361.52,"to":363.8,"location":2,"content":"which is, \"Students opened their.\""},{"from":363.8,"to":367.62,"location":2,"content":"So, as a reminder, n-gram language model says that,"},{"from":367.62,"to":368.94,"location":2,"content":"the probability of the next word being,"},{"from":368.94,"to":373.23,"location":2,"content":"some particular word W in the vocabulary is equal to the number of times we saw"},{"from":373.23,"to":375.51,"location":2,"content":"students opened their W divided by the number of"},{"from":375.51,"to":378.65,"location":2,"content":"times we saw students opened their, in the training corpus."},{"from":378.65,"to":381.44,"location":2,"content":"So, let's suppose that in our training corpus,"},{"from":381.44,"to":384.21,"location":2,"content":"we saw the phrase \"students open their\" 1,000 times."},{"from":384.21,"to":388.34,"location":2,"content":"And suppose that, we saw \"students opened their books\" 400 times."},{"from":388.34,"to":392.22,"location":2,"content":"This means that the probability of the next word being books is 0.4."},{"from":392.22,"to":396.81,"location":2,"content":"And uh, similarly, let's suppose that we saw students open their exams 100 times,"},{"from":396.81,"to":399.26,"location":2,"content":"this means that the probability of exams given students"},{"from":399.26,"to":401.93,"location":2,"content":"open their is 0.1. Is there a question?"},{"from":401.93,"to":404.9,"location":2,"content":"[inaudible]."},{"from":404.9,"to":407.01,"location":2,"content":"The question is, does the order of the words matter?"},{"from":407.01,"to":410.34,"location":2,"content":"And the answer is yes, the order of students open there does matter."},{"from":410.34,"to":413.19,"location":2,"content":"It's different to \"the students opened.\""},{"from":413.19,"to":416.99,"location":2,"content":"So, the question I want to raise now is,"},{"from":416.99,"to":420.81,"location":2,"content":"was it a good idea for us to discard the proctor context?"},{"from":420.81,"to":423.12,"location":2,"content":"If you look at the actual example that we had,"},{"from":423.12,"to":426.07,"location":2,"content":"the example was as the proctor started the clock,"},{"from":426.07,"to":427.85,"location":2,"content":"the students opened their blank."},{"from":427.85,"to":432.36,"location":2,"content":"So, do we think that books or exams is more likely given the actual context,"},{"from":432.36,"to":434.55,"location":2,"content":"the full context? Yep."},{"from":434.55,"to":435.45,"location":2,"content":"Exams."},{"from":435.45,"to":437.8,"location":2,"content":"Right. Exams is more likely because the proctor and"},{"from":437.8,"to":440.26,"location":2,"content":"the clock heavily implies that it's an exam scenario, so"},{"from":440.26,"to":442.63,"location":2,"content":"they're more likely to be opening the exams than the books,"},{"from":442.63,"to":444.4,"location":2,"content":"unless it's an open book exam."},{"from":444.4,"to":446.83,"location":2,"content":"Uh, but I think, overall, it should be exams."},{"from":446.83,"to":449.89,"location":2,"content":"So, the problem that we're seeing here is that in the training corpus,"},{"from":449.89,"to":451.24,"location":2,"content":"the fact that students were opening"},{"from":451.24,"to":453.99,"location":2,"content":"something means that it's more likely to be books than exams"},{"from":453.99,"to":456.31,"location":2,"content":"because overall, books are more common than exams."},{"from":456.31,"to":458.56,"location":2,"content":"But if we know that the context is,"},{"from":458.56,"to":461.08,"location":2,"content":"the proctor and the clock, then it should be exams."},{"from":461.08,"to":464.24,"location":2,"content":"So, what I'm highlighting here is a problem with our simplifying assumption."},{"from":464.24,"to":465.86,"location":2,"content":"If we throw away too much context,"},{"from":465.86,"to":470.45,"location":2,"content":"then we are not as good as predicting the words as we would be if we kept the context."},{"from":470.45,"to":474.69,"location":2,"content":"Okay. So, that's one problem with n-gram, uh, language models."},{"from":474.69,"to":476.81,"location":2,"content":"Uh, there are some other problems as well."},{"from":476.81,"to":480.47,"location":2,"content":"So, uh, here again is the equation that you saw before."},{"from":480.47,"to":481.88,"location":2,"content":"One problem which we're gonna call"},{"from":481.88,"to":485.46,"location":2,"content":"the sparsity problem is what happens if the number on top,"},{"from":485.46,"to":488.38,"location":2,"content":"the numerator, what if that count is equal to zero."},{"from":488.38,"to":491.21,"location":2,"content":"So, what if for some particular word W,"},{"from":491.21,"to":494.45,"location":2,"content":"the phrase students opened their W never occurred in the data."},{"from":494.45,"to":497.24,"location":2,"content":"So, for example, let's suppose students opened their petri dishes,"},{"from":497.24,"to":499.88,"location":2,"content":"is fairly uncommon and it never appears in the data,"},{"from":499.88,"to":504.36,"location":2,"content":"then that means our probability of the next word being petri dishes will be zero."},{"from":504.36,"to":507.39,"location":2,"content":"And this is bad, because it might be uncommon but it is,"},{"from":507.39,"to":509.38,"location":2,"content":"a valid scenario, right?"},{"from":509.38,"to":511.09,"location":2,"content":"If you're a biology student for example."},{"from":511.09,"to":514.09,"location":2,"content":"So, this is a problem and we call it the sparsity problem,"},{"from":514.09,"to":517.79,"location":2,"content":"because the problem is that if we'd never seen an event happen in the training data,"},{"from":517.79,"to":521.49,"location":2,"content":"then our model assigns zero probability to that event."},{"from":521.49,"to":526.41,"location":2,"content":"So, one partial solution to this problem is that maybe we should add a small delta,"},{"from":526.41,"to":528.29,"location":2,"content":"small number delta to the count,"},{"from":528.29,"to":530.42,"location":2,"content":"for every word in the vocabulary."},{"from":530.42,"to":533.92,"location":2,"content":"And then this way, every possible word that come next,"},{"from":533.92,"to":536.25,"location":2,"content":"has at least some small probability."},{"from":536.25,"to":539.09,"location":2,"content":"So, petri dishes will have some small probability,"},{"from":539.09,"to":542.41,"location":2,"content":"but then so, will all of the other words which are possibly bad choices."},{"from":542.41,"to":545.58,"location":2,"content":"So, this, uh, technique is called smoothing, because the idea is,"},{"from":545.58,"to":546.95,"location":2,"content":"you're going from a very, uh,"},{"from":546.95,"to":550.05,"location":2,"content":"sparse probability distribution, which is zero, almost everywhere,"},{"from":550.05,"to":551.55,"location":2,"content":"with a few spikes where there's,"},{"from":551.55,"to":553.45,"location":2,"content":"uh, being n-grams that we've seen,"},{"from":553.45,"to":556.1,"location":2,"content":"it goes from that to being a more smooth probability distribution"},{"from":556.1,"to":559.62,"location":2,"content":"where everything has at least a small probability on it."},{"from":559.62,"to":564.27,"location":2,"content":"So, the second sparsity problem which is possibly worse than the first one is,"},{"from":564.27,"to":568.13,"location":2,"content":"what happens if the number in the denominator is zero?"},{"from":568.13,"to":570.2,"location":2,"content":"So, in our example, that would mean,"},{"from":570.2,"to":574.65,"location":2,"content":"what if we never even saw the trigram \"students opened their\" in the training data."},{"from":574.65,"to":578.48,"location":2,"content":"If that happens, then we can't even calculate this probability distribution at"},{"from":578.48,"to":582.82,"location":2,"content":"all for any word W because we never even saw this context before."},{"from":582.82,"to":585.83,"location":2,"content":"So, a possible solution to this is that"},{"from":585.83,"to":588.45,"location":2,"content":"if you can't find \"students open their\" in the corpus,"},{"from":588.45,"to":591.94,"location":2,"content":"then you should back off to just conditioning on the last two words,"},{"from":591.94,"to":593.54,"location":2,"content":"rather than the last three words."},{"from":593.54,"to":595.9,"location":2,"content":"So, now you'd be looking at times when you'd seen,"},{"from":595.9,"to":598.46,"location":2,"content":"uh, \"open their\" and seeing what what's come next."},{"from":598.46,"to":601.35,"location":2,"content":"So, this is called back-off because in this failure case,"},{"from":601.35,"to":604.02,"location":2,"content":"for when you have no data for your 4-gram language model,"},{"from":604.02,"to":606.02,"location":2,"content":"you're backing off to a trigram language model."},{"from":606.02,"to":612.31,"location":2,"content":"Are there any questions at this point?"},{"from":612.31,"to":617.57,"location":2,"content":"Okay. So, um, another thing to note is that these sparsity problems"},{"from":617.57,"to":622.1,"location":2,"content":"get worse if you increase N. If you make N larger in your n-gram language model,"},{"from":622.1,"to":623.87,"location":2,"content":"and you might want to do this, for example,"},{"from":623.87,"to":626.39,"location":2,"content":"you might think, uh, I want to have a larger context,"},{"from":626.39,"to":628.57,"location":2,"content":"so I can pay attention to words that"},{"from":628.57,"to":630.89,"location":2,"content":"happened longer ago and that's gonna make it a better predictor."},{"from":630.89,"to":633.27,"location":2,"content":"So, you might think making N bigger is a good idea."},{"from":633.27,"to":636.41,"location":2,"content":"But the problem is if you do that then the sparsity problems get worse."},{"from":636.41,"to":637.7,"location":2,"content":"Because, let's suppose you say,"},{"from":637.7,"to":639.22,"location":2,"content":"I want a 10-gram language model."},{"from":639.22,"to":640.91,"location":2,"content":"Then the problem is that you're going to be counting,"},{"from":640.91,"to":643.48,"location":2,"content":"how often you seen process in 9-grams and 10-grams."},{"from":643.48,"to":645.49,"location":2,"content":"But 9-grams and 10-grams, there's so many of them,"},{"from":645.49,"to":647.62,"location":2,"content":"that the one you are interested in probably never occurred,"},{"from":647.62,"to":651.15,"location":2,"content":"in your training data which means that the whole thing becomes dysfunctional."},{"from":651.15,"to":656.17,"location":2,"content":"So, in practice, we usually can't have N much bigger than five."},{"from":656.17,"to":658.49,"location":2,"content":"Okay. So, that was, uh,"},{"from":658.49,"to":660.88,"location":2,"content":"two sparsity problems with n-gram language models."},{"from":660.88,"to":662.77,"location":2,"content":"Here is a problem with storage."},{"from":662.77,"to":664.71,"location":2,"content":"So, if we look at this equation, uh,"},{"from":664.71,"to":666.78,"location":2,"content":"you have to think about what do you need to"},{"from":666.78,"to":669.37,"location":2,"content":"store in order to use your n-gram language model."},{"from":669.37,"to":672.02,"location":2,"content":"You need to store this count number,"},{"from":672.02,"to":674.09,"location":2,"content":"for all of the n-grams that you observed in"},{"from":674.09,"to":677.22,"location":2,"content":"the corpus when you were going through the training corpus counting them."},{"from":677.22,"to":679.44,"location":2,"content":"And the problem is, that as you increase N,"},{"from":679.44,"to":683.48,"location":2,"content":"then this number of n-grams that you have to store and count increases."},{"from":683.48,"to":687.51,"location":2,"content":"So, another problem with increasing N is that the size of your model,"},{"from":687.51,"to":691.49,"location":2,"content":"or your n-gram model, uh, gets bigger."},{"from":691.49,"to":697.22,"location":2,"content":"Okay, so n-gram Language Models in practice. Let's look at an example."},{"from":697.22,"to":702.54,"location":2,"content":"You can actually build a simple trigram Language Model over a 1.7 million word corpus,"},{"from":702.54,"to":704.33,"location":2,"content":"uh, in a few seconds on your laptop."},{"from":704.33,"to":706.14,"location":2,"content":"And in fact, the corpus that I used to do this"},{"from":706.14,"to":707.97,"location":2,"content":"was the same one that you met in assignment one."},{"from":707.97,"to":709.61,"location":2,"content":"It's Reuters' corpus which is,"},{"from":709.61,"to":711.18,"location":2,"content":"uh, business and financial news."},{"from":711.18,"to":712.38,"location":2,"content":"So, if you want to do this yourself,"},{"from":712.38,"to":715,"location":2,"content":"you can follow that link at the bottom of the slide later."},{"from":715,"to":717,"location":2,"content":"So, uh, this is, uh,"},{"from":717,"to":719.28,"location":2,"content":"something which I ran on my laptop in a few second."},{"from":719.28,"to":722.79,"location":2,"content":"So I gave it the context of the bigram today the,"},{"from":722.79,"to":726.48,"location":2,"content":"and then I asked the trigram Language Model what word is likely to come next."},{"from":726.48,"to":729.86,"location":2,"content":"So, the Language Model said that the top next most likely words are"},{"from":729.86,"to":733.46,"location":2,"content":"company, bank, price, Italian, emirate, et cetera."},{"from":733.46,"to":737.64,"location":2,"content":"So already just looking at these probabilities that are assigned to these different words,"},{"from":737.64,"to":739.59,"location":2,"content":"uh, you can see that there is a sparsity problem."},{"from":739.59,"to":741.84,"location":2,"content":"For example, the top two most likely words have"},{"from":741.84,"to":744.72,"location":2,"content":"the exact same probability and the reason for that is,"},{"from":744.72,"to":746.76,"location":2,"content":"that this number is 4 over 26."},{"from":746.76,"to":748.8,"location":2,"content":"So these are quite small integers, uh,"},{"from":748.8,"to":750.27,"location":2,"content":"meaning that we only saw, uh,"},{"from":750.27,"to":753,"location":2,"content":"today the company and today the bank four times each."},{"from":753,"to":754.56,"location":2,"content":"So, uh, this is an example of"},{"from":754.56,"to":757.29,"location":2,"content":"the sparsity problem because overall these are quite low counts,"},{"from":757.29,"to":759.16,"location":2,"content":"we haven't seen that many different, uh,"},{"from":759.16,"to":760.5,"location":2,"content":"versions of this event,"},{"from":760.5,"to":763.88,"location":2,"content":"so we don't have a very granular probability distribution."},{"from":763.88,"to":766.38,"location":2,"content":"But in any case ignoring the sparsity problem,"},{"from":766.38,"to":767.76,"location":2,"content":"I would say that overall,"},{"from":767.76,"to":772.6,"location":2,"content":"these, uh, top suggestions look pretty reasonable."},{"from":772.6,"to":775.67,"location":2,"content":"So you can actually use a Language Model to"},{"from":775.67,"to":778.3,"location":2,"content":"generate text and this is how you would do it."},{"from":778.3,"to":780.74,"location":2,"content":"So let's suppose you have your first two words already, uh,"},{"from":780.74,"to":784.56,"location":2,"content":"you condition on this and you ask your Language Model what's likely to come next."},{"from":784.56,"to":787.3,"location":2,"content":"So then given this probability distribution over the words,"},{"from":787.3,"to":788.85,"location":2,"content":"you can sample from it, that is,"},{"from":788.85,"to":791.87,"location":2,"content":"select some words with, you know, the associated probability."},{"from":791.87,"to":794.24,"location":2,"content":"So let's suppose that gives us the word price."},{"from":794.24,"to":797.73,"location":2,"content":"So then price is your next word, and then you just condition on the last two words,"},{"from":797.73,"to":800.38,"location":2,"content":"which in this ex- example is now the price."},{"from":800.38,"to":803.79,"location":2,"content":"So now you get a new probability distribution and you can continue this process,"},{"from":803.79,"to":807.96,"location":2,"content":"uh, sampling and then conditioning again and sampling."},{"from":807.96,"to":810.15,"location":2,"content":"So if you do this long enough,"},{"from":810.15,"to":811.35,"location":2,"content":"you will get a piece of text,"},{"from":811.35,"to":813.69,"location":2,"content":"so this is the actual text that I got when"},{"from":813.69,"to":817,"location":2,"content":"I run this generation process with this trigram Language Model."},{"from":817,"to":820.26,"location":2,"content":"So it says, \"Today the price of gold per ton,"},{"from":820.26,"to":823.26,"location":2,"content":"while production of shoe lasts and shoe industry,"},{"from":823.26,"to":826.23,"location":2,"content":"the bank intervened just after it considered and rejected"},{"from":826.23,"to":829.37,"location":2,"content":"an IMF demand to rebuild depleted European stocks,"},{"from":829.37,"to":832.81,"location":2,"content":"September, 30th end primary 76 counts a share.''"},{"from":832.81,"to":835.25,"location":2,"content":"Okay. So, uh, what do we think about this text?"},{"from":835.25,"to":839.2,"location":2,"content":"We think it's good? We, uh, surprised?"},{"from":839.2,"to":842.37,"location":2,"content":"Um, I would say that in some ways it is good,"},{"from":842.37,"to":844.62,"location":2,"content":"it's kind of surprisingly grammatical, you know,"},{"from":844.62,"to":847.86,"location":2,"content":"it mostly, uh, kind of pauses,"},{"from":847.86,"to":849.15,"location":2,"content":"uh, but you would definitely say that it,"},{"from":849.15,"to":850.5,"location":2,"content":"it doesn't really make any sense."},{"from":850.5,"to":852.18,"location":2,"content":"It's pretty incoherent."},{"from":852.18,"to":854.58,"location":2,"content":"And we shouldn't be surprised that it's incoherent I"},{"from":854.58,"to":857.72,"location":2,"content":"think because if you remember this is a trigram Language Model,"},{"from":857.72,"to":860.26,"location":2,"content":"it has a memory of just the last well,"},{"from":860.26,"to":862.63,"location":2,"content":"three or two words depending on how you look at it."},{"from":862.63,"to":864.51,"location":2,"content":"So clearly we need to consider"},{"from":864.51,"to":867.99,"location":2,"content":"more than three words at a time if we want to model language well."},{"from":867.99,"to":872.26,"location":2,"content":"But as we already know, increasing n makes the sparsity problem worse,"},{"from":872.26,"to":878.37,"location":2,"content":"n-gram Language Models, and it also increases model size. Is that a question?"},{"from":878.37,"to":880.32,"location":2,"content":"How does it [inaudible] [NOISE]"},{"from":880.32,"to":883.38,"location":2,"content":"So the question is, how does the n-gram Language Model know when to put commas."},{"from":883.38,"to":885.15,"location":2,"content":"Uh, so you can,"},{"from":885.15,"to":890.4,"location":2,"content":"[NOISE] decide that commas and other punctuation are just another kind of word,"},{"from":890.4,"to":891.71,"location":2,"content":"is that well or token,"},{"from":891.71,"to":894.51,"location":2,"content":"and then, to the Language Model it doesn't really make much difference."},{"from":894.51,"to":897.71,"location":2,"content":"It's just used that as another possible world that can be, um, predicted,"},{"from":897.71,"to":899.45,"location":2,"content":"that's why we've got the weird spacing around the,"},{"from":899.45,"to":901.77,"location":2,"content":"the commas is because it was essentially viewed as a separate word."},{"from":901.77,"to":906.13,"location":2,"content":"[NOISE] Okay."},{"from":906.13,"to":909.2,"location":2,"content":"So this course is called NLP with Deep Learning."},{"from":909.2,"to":912.76,"location":2,"content":"So you probably thinking how do we build a neural Language Model?"},{"from":912.76,"to":915.45,"location":2,"content":"So let's just recap, uh, in case you forgot."},{"from":915.45,"to":917.94,"location":2,"content":"Remember that a Language Model is something that takes"},{"from":917.94,"to":920.76,"location":2,"content":"inputs which is a sequence of words X1 up to Xt,"},{"from":920.76,"to":927.47,"location":2,"content":"and then it outputs a probability distribution of what the next word might be Xt plus 1."},{"from":927.47,"to":932.07,"location":2,"content":"Okay, so when we think about what kind of neural models we've met in this course so far."},{"from":932.07,"to":934.54,"location":2,"content":"Uh, we've already met window-based neural models."},{"from":934.54,"to":936.78,"location":2,"content":"And in lecture three, we saw how you could apply"},{"from":936.78,"to":940.03,"location":2,"content":"a window-based neural model to a named entity recognition."},{"from":940.03,"to":943.05,"location":2,"content":"So in that scenario you take some kind of window around the word that you"},{"from":943.05,"to":946.13,"location":2,"content":"care about which in this example is Paris, and then, uh,"},{"from":946.13,"to":948.78,"location":2,"content":"you get the word embeddings for those, concatenate them put them through"},{"from":948.78,"to":952.89,"location":2,"content":"some layers, and then you get your decision which is that Paris is a location not,"},{"from":952.89,"to":955.42,"location":2,"content":"you know, a person or organization."},{"from":955.42,"to":957.9,"location":2,"content":"So that's a recap of what we saw in lecture three."},{"from":957.9,"to":963.79,"location":2,"content":"How would we apply a model like this to language modeling? So here's how you would do it."},{"from":963.79,"to":966.93,"location":2,"content":"Here's an example of a fixed-window neural language model."},{"from":966.93,"to":969.42,"location":2,"content":"So, again, we have some kind of context"},{"from":969.42,"to":972.06,"location":2,"content":"which is, as the proctor started the clock the students opened their,"},{"from":972.06,"to":975.23,"location":2,"content":"um, we're trying to guess what word might come next."},{"from":975.23,"to":978.45,"location":2,"content":"So we have to make a similar simplifying assumption to before."},{"from":978.45,"to":981.25,"location":2,"content":"Uh, because it's a fixed size window, uh,"},{"from":981.25,"to":985.5,"location":2,"content":"we have to discard the context except for the window that we're conditioning on."},{"from":985.5,"to":989.07,"location":2,"content":"So let's suppose that our fixed window is of size four."},{"from":989.07,"to":994.39,"location":2,"content":"So what we'll do is similarly to the, ah, NER model."},{"from":994.39,"to":998.4,"location":2,"content":"We're going to represent these words with one-hot vectors,"},{"from":998.4,"to":1002.75,"location":2,"content":"and then we'll use those to look up the word embeddings for these words using the,"},{"from":1002.75,"to":1004.89,"location":2,"content":"uh, embedding lookup matrix."},{"from":1004.89,"to":1008.08,"location":2,"content":"So then we get all of our word embeddings E,1, 2, 3, 4,"},{"from":1008.08,"to":1011.27,"location":2,"content":"and then we concatenate them together to get e. We put this through"},{"from":1011.27,"to":1015.22,"location":2,"content":"a linear layer and a nonlinearity function f to get some kind of hidden layer,"},{"from":1015.22,"to":1017.72,"location":2,"content":"and then we put it through another linear layer and"},{"from":1017.72,"to":1021.86,"location":2,"content":"the softmax function and now we have an output probability distribution y hat."},{"from":1021.86,"to":1025.92,"location":2,"content":"And in our case because we're trying to predict what word comes next, ah, ah,"},{"from":1025.92,"to":1028.43,"location":2,"content":"vector y hat will be of length v where v is"},{"from":1028.43,"to":1030.02,"location":2,"content":"the vocabulary and it will contain"},{"from":1030.02,"to":1032.56,"location":2,"content":"the probabilities of all the different words in the vocabulary."},{"from":1032.56,"to":1035.6,"location":2,"content":"So here I've represented that as a bar charts where if you suppose"},{"from":1035.6,"to":1038.69,"location":2,"content":"you've got all of the words listed alphabetically from a to z,"},{"from":1038.69,"to":1041.3,"location":2,"content":"and then there's the different probabilities of the words."},{"from":1041.3,"to":1042.85,"location":2,"content":"So if everything goes well,"},{"from":1042.85,"to":1044.48,"location":2,"content":"then this language model should tell us that"},{"from":1044.48,"to":1047.93,"location":2,"content":"some likely next words are books and laptops, for example."},{"from":1047.93,"to":1049.94,"location":2,"content":"So none of this should be, um,"},{"from":1049.94,"to":1051.77,"location":2,"content":"unfamiliar to you because you saw it all last week."},{"from":1051.77,"to":1056.47,"location":2,"content":"We're just applying a Window-based model to a different task, such as language modeling."},{"from":1056.47,"to":1058.94,"location":2,"content":"Okay, so what are,"},{"from":1058.94,"to":1062.24,"location":2,"content":"some good things about this model compared to n-gram language models?"},{"from":1062.24,"to":1066.31,"location":2,"content":"So one, ah, advantage I'd say is that there's no sparsity problem."},{"from":1066.31,"to":1069.69,"location":2,"content":"If you remember an n-gram language model has a sparsity problem"},{"from":1069.69,"to":1073.2,"location":2,"content":"which is that if you've never seen a particular n-gram in training then,"},{"from":1073.2,"to":1075.01,"location":2,"content":"you can't assign any probability to it."},{"from":1075.01,"to":1076.44,"location":2,"content":"You don't have any data on it."},{"from":1076.44,"to":1079.34,"location":2,"content":"Whereas at least here you can take any, you know, for example,"},{"from":1079.34,"to":1082.12,"location":2,"content":"4-gram you want and you can feed it into the, ah,"},{"from":1082.12,"to":1083.8,"location":2,"content":"the neural nets and it will give you"},{"from":1083.8,"to":1086.15,"location":2,"content":"an output distribution of what it thinks the next word would be."},{"from":1086.15,"to":1090.24,"location":2,"content":"It might not be a good prediction but at least it will, it will run."},{"from":1090.24,"to":1092.93,"location":2,"content":"Another advantage is you don't need to store"},{"from":1092.93,"to":1095.09,"location":2,"content":"all of the observed n-grams that you ever saw."},{"from":1095.09,"to":1097.28,"location":2,"content":"So, uh, this an advantage by, uh,"},{"from":1097.28,"to":1099.23,"location":2,"content":"comparison you just have to store"},{"from":1099.23,"to":1102.15,"location":2,"content":"all of the word vectors for all the words in your vocabulary."},{"from":1102.15,"to":1106.09,"location":2,"content":"Uh, but there are quite a lot of problems with this fixed-window language model."},{"from":1106.09,"to":1109.16,"location":2,"content":"So here are some remaining problems: Uh,"},{"from":1109.16,"to":1111.47,"location":2,"content":"one is that your fixed window is probably too small."},{"from":1111.47,"to":1113.88,"location":2,"content":"No matter how big you make your fixed window, uh,"},{"from":1113.88,"to":1115.64,"location":2,"content":"you're probably going to be losing some kind of"},{"from":1115.64,"to":1118.49,"location":2,"content":"useful context that you would want to use sometimes."},{"from":1118.49,"to":1121.74,"location":2,"content":"And in fact, if you try to enlarge the window size,"},{"from":1121.74,"to":1124.17,"location":2,"content":"then you also have to enlarge the size of your,"},{"from":1124.17,"to":1125.48,"location":2,"content":"uh, weight factor, sorry,"},{"from":1125.48,"to":1127.58,"location":2,"content":"your weight matrix W. Uh,"},{"from":1127.58,"to":1129.59,"location":2,"content":"so the width of W because you're multiplying it"},{"from":1129.59,"to":1132.11,"location":2,"content":"by e which is the concatenation of your word embeddings."},{"from":1132.11,"to":1136.39,"location":2,"content":"The width of W grows as you increase the size of your window."},{"from":1136.39,"to":1141.28,"location":2,"content":"So in inclusion really your window can never be large enough."},{"from":1141.28,"to":1145.46,"location":2,"content":"Another problem with this model which is more of a subtle point is that"},{"from":1145.46,"to":1148.82,"location":2,"content":"X1 and X2 and really all of the words in the window they're,"},{"from":1148.82,"to":1151.1,"location":2,"content":"uh, multiplied by completely diffe rent weights in"},{"from":1151.1,"to":1154.57,"location":2,"content":"W. So to demonstrate this you could draw a picture."},{"from":1154.57,"to":1157.61,"location":2,"content":"So the problem is that if you have"},{"from":1157.61,"to":1161.72,"location":2,"content":"your weight matrix W and then you have"},{"from":1161.72,"to":1166.91,"location":2,"content":"your concatenation of embeddings e and we have, uh, four embeddings."},{"from":1166.91,"to":1170.39,"location":2,"content":"So we have e_1, e_2, e_3,"},{"from":1170.39,"to":1173.13,"location":2,"content":"e_4, and you multiply, uh,"},{"from":1173.13,"to":1176.62,"location":2,"content":"the concatenated embeddings by the weight matrix."},{"from":1176.62,"to":1179.12,"location":2,"content":"So really you can see that there are essentially"},{"from":1179.12,"to":1182.45,"location":2,"content":"kind of four sections of the weight matrix,"},{"from":1182.45,"to":1185.57,"location":2,"content":"and the first word embedding e_1 is only"},{"from":1185.57,"to":1188.83,"location":2,"content":"ever multiplied by the weights for it in this section,"},{"from":1188.83,"to":1193.03,"location":2,"content":"and that's completely separate to the weights that multiply by e_2 and so forth."},{"from":1193.03,"to":1196.7,"location":2,"content":"So the problem with this is that what you"},{"from":1196.7,"to":1200.06,"location":2,"content":"learn in the weight matrix in one section is not shared with the others."},{"from":1200.06,"to":1203.98,"location":2,"content":"You're kind of learning a lot of similar functions four times."},{"from":1203.98,"to":1207.91,"location":2,"content":"So the reason why we think this is a problem is because there should be a lot of"},{"from":1207.91,"to":1212.13,"location":2,"content":"commonalities in how you process the incoming word embeddings."},{"from":1212.13,"to":1214.88,"location":2,"content":"So what you learn about how to process, you know,"},{"from":1214.88,"to":1218.38,"location":2,"content":"the third embedding, some of it at least should be shared with all of the embeddings."},{"from":1218.38,"to":1221.96,"location":2,"content":"So what I'm saying is it's kind of inefficient that we're learning, uh,"},{"from":1221.96,"to":1224.3,"location":2,"content":"all of these separate weights for these different words"},{"from":1224.3,"to":1229.84,"location":2,"content":"when there's a lot of commonalities between them. Is there a question?"},{"from":1229.84,"to":1231.18,"location":2,"content":"So that's why [inaudible] [NOISE]."},{"from":1231.18,"to":1231.96,"location":2,"content":"Okay-"},{"from":1231.96,"to":1238.28,"location":2,"content":"Yeah, hopefully- hopefully the verbal description is on."},{"from":1238.28,"to":1242.31,"location":2,"content":"So, in conclusion, I'd say that the biggest problem that we've got with"},{"from":1242.31,"to":1245.28,"location":2,"content":"this fixed-size neural model is that clearly we"},{"from":1245.28,"to":1248.36,"location":2,"content":"need some kind of neural architecture that can process any length input,"},{"from":1248.36,"to":1251.07,"location":2,"content":"because most of the problems here come from the fact that we had to make"},{"from":1251.07,"to":1256.67,"location":2,"content":"this simplifying assumption that there was a fixed window."},{"from":1256.67,"to":1260.04,"location":2,"content":"Okay. So this motivates, uh,"},{"from":1260.04,"to":1262.59,"location":2,"content":"us to introduce this new family of neural architecture,"},{"from":1262.59,"to":1265.52,"location":2,"content":"it's called recurrent neural networks or RNNs."},{"from":1265.52,"to":1269.1,"location":2,"content":"So, this is a simplified diagram that shows you the most important,"},{"from":1269.1,"to":1271.32,"location":2,"content":"um, features of an RNN."},{"from":1271.32,"to":1275.07,"location":2,"content":"So we have again an input sequence of X1, X2,"},{"from":1275.07,"to":1280.24,"location":2,"content":"et cetera, but you can assume that this sequence is of any arbitrary length you like."},{"from":1280.24,"to":1284.46,"location":2,"content":"The idea is that you have a sequence of hidden states instead of just having,"},{"from":1284.46,"to":1287.17,"location":2,"content":"for example, one hidden state as we did in the previous model."},{"from":1287.17,"to":1290.94,"location":2,"content":"We have a sequence of hidden states and we have as many of them as we have inputs."},{"from":1290.94,"to":1295.44,"location":2,"content":"And the important thing is that each hidden state ht is computed based"},{"from":1295.44,"to":1300.32,"location":2,"content":"on the previous hidden state and also the input on that step."},{"from":1300.32,"to":1304.05,"location":2,"content":"So the reason why they're called hidden states is because you could think of"},{"from":1304.05,"to":1307.42,"location":2,"content":"this as a single state that's mutating over time."},{"from":1307.42,"to":1310.26,"location":2,"content":"It's kind of like several versions of the same thing."},{"from":1310.26,"to":1313.83,"location":2,"content":"And for this reason, we often call these time-steps, right?"},{"from":1313.83,"to":1315.54,"location":2,"content":"So these steps that go left to right,"},{"from":1315.54,"to":1318.95,"location":2,"content":"we often call them time-steps."},{"from":1318.95,"to":1321.87,"location":2,"content":"So the really important thing is that"},{"from":1321.87,"to":1327.21,"location":2,"content":"the same weight matrix W is applied on every time-step of this RNN."},{"from":1327.21,"to":1331.37,"location":2,"content":"That's what makes us able to process any length input we want."},{"from":1331.37,"to":1333.93,"location":2,"content":"Is because we don't have to have different weights on every step,"},{"from":1333.93,"to":1338.87,"location":2,"content":"because we just apply the exact same transformation on every step."},{"from":1338.87,"to":1342.69,"location":2,"content":"So additionally, you can also have some outputs from the RNN."},{"from":1342.69,"to":1343.99,"location":2,"content":"So these y hats,"},{"from":1343.99,"to":1346.15,"location":2,"content":"these are the outputs on each step."},{"from":1346.15,"to":1348.73,"location":2,"content":"And they're optional because you don't have to compute them"},{"from":1348.73,"to":1351.21,"location":2,"content":"or you can compute them on just some steps and not others."},{"from":1351.21,"to":1354.92,"location":2,"content":"It depends on where you want to use your RNN to do."},{"from":1354.92,"to":1358.26,"location":2,"content":"Okay. So that's a simple diagram of an RNN."},{"from":1358.26,"to":1359.85,"location":2,"content":"Uh, here I'm going to give you a bit more detail."},{"from":1359.85,"to":1363.63,"location":2,"content":"So here's how you would apply an RNN to do language modeling."},{"from":1363.63,"to":1368.17,"location":2,"content":"So, uh, again, let's suppose that we have some kind of text so far."},{"from":1368.17,"to":1370.86,"location":2,"content":"My text is only four words long,"},{"from":1370.86,"to":1373.32,"location":2,"content":"but you can assume that it could be any length, right?"},{"from":1373.32,"to":1375.42,"location":2,"content":"It's just short because we can't fit more on the slide."},{"from":1375.42,"to":1378.39,"location":2,"content":"So you have some sequence of tags, which could be kind of long."},{"from":1378.39,"to":1382.02,"location":2,"content":"And again, we're going to represent these by some kind of one-hot vectors and"},{"from":1382.02,"to":1386.46,"location":2,"content":"use those to look up the word embeddings from our embedding matrix."},{"from":1386.46,"to":1390.37,"location":2,"content":"So then to compute the first hidden state H1,"},{"from":1390.37,"to":1394.3,"location":2,"content":"we need to compute it based on the previous hidden state and the current input."},{"from":1394.3,"to":1396.62,"location":2,"content":"We already have the current input, that's E1,"},{"from":1396.62,"to":1399.57,"location":2,"content":"but the question is where do we get this first hidden state from?"},{"from":1399.57,"to":1401.16,"location":2,"content":"All right, what comes before H1?"},{"from":1401.16,"to":1404.67,"location":2,"content":"So we often call the initial hidden state H0, uh, yes,"},{"from":1404.67,"to":1408.02,"location":2,"content":"we call the initial hidden state and it can either be something that you learn,"},{"from":1408.02,"to":1412.07,"location":2,"content":"like it's a parameter of the network and you learn how to initialize it,"},{"from":1412.07,"to":1415.39,"location":2,"content":"or you can assume something like maybe it's the zero vector."},{"from":1415.39,"to":1420.49,"location":2,"content":"So the formula we use to compute the new hidden state based on the previous one,"},{"from":1420.49,"to":1423.19,"location":2,"content":"and also the current inputs is written on the left."},{"from":1423.19,"to":1426.69,"location":2,"content":"So you do a linear transformation on the previous hidden state and on"},{"from":1426.69,"to":1428.64,"location":2,"content":"the current input and then you add some kind of"},{"from":1428.64,"to":1430.92,"location":2,"content":"bias and then put it through a non-linearity,"},{"from":1430.92,"to":1432.99,"location":2,"content":"like for example, the sigmoid function."},{"from":1432.99,"to":1436.67,"location":2,"content":"And that gives you a new hidden state."},{"from":1436.67,"to":1439.47,"location":2,"content":"Okay. So, once you've done that,"},{"from":1439.47,"to":1441.48,"location":2,"content":"then you can compute the next hidden state and you"},{"from":1441.48,"to":1443.85,"location":2,"content":"can keep unrolling the network like this."},{"from":1443.85,"to":1446.03,"location":2,"content":"And that's, uh, yeah,"},{"from":1446.03,"to":1447.45,"location":2,"content":"that's called unrolling because you're kind of"},{"from":1447.45,"to":1450.27,"location":2,"content":"computing each step given the previous one."},{"from":1450.27,"to":1452.16,"location":2,"content":"All right. So finally, if you remember,"},{"from":1452.16,"to":1453.33,"location":2,"content":"we're trying to do language modeling."},{"from":1453.33,"to":1457.53,"location":2,"content":"So we're trying to predict which words should come next after the students opened their."},{"from":1457.53,"to":1459.87,"location":2,"content":"So on this fourth step over here,"},{"from":1459.87,"to":1461.2,"location":2,"content":"we can use, uh,"},{"from":1461.2,"to":1462.83,"location":2,"content":"the current hidden state, H4,"},{"from":1462.83,"to":1467.43,"location":2,"content":"and put it through a linear layer and put it through a softmax function and then we get"},{"from":1467.43,"to":1472.8,"location":2,"content":"our output distribution Y-hat 4 which is a distribution over the vocabulary."},{"from":1472.8,"to":1474.72,"location":2,"content":"And again, hopefully, we'll get some kind of"},{"from":1474.72,"to":1478.08,"location":2,"content":"sensible estimates for what the next word might be."},{"from":1478.08,"to":1483.21,"location":2,"content":"Any questions at this point. Yep?"},{"from":1483.21,"to":1487.65,"location":2,"content":"Is the- the number of hidden state or is it gonna be the number of words in your input?"},{"from":1487.65,"to":1490.85,"location":2,"content":"The question is, is the number of hidden states the number of words in your input?"},{"from":1490.85,"to":1493.48,"location":2,"content":"Yeah, in this setting here, uh, yes,"},{"from":1493.48,"to":1498.4,"location":2,"content":"or you could say more generally the number of hidden states is the number of inputs. Yep."},{"from":1498.4,"to":1499.95,"location":2,"content":"And just as with the n-gram model,"},{"from":1499.95,"to":1505.59,"location":2,"content":"we could use the output as the input from the tasks mutation in transformational model?"},{"from":1505.59,"to":1507,"location":2,"content":"Yeah, so the question is,"},{"from":1507,"to":1508.65,"location":2,"content":"as with the n-gram language model,"},{"from":1508.65,"to":1510.57,"location":2,"content":"could we use the output as the input on the next step?"},{"from":1510.57,"to":1512.71,"location":2,"content":"And the answer is yes, and I'll show you that in a minute."},{"from":1512.71,"to":1515.7,"location":2,"content":"Any other questions? Yeah."},{"from":1515.7,"to":1517.99,"location":2,"content":"Are you learning the embedding?"},{"from":1517.99,"to":1520.56,"location":2,"content":"The question is, are you learning the embeddings?"},{"from":1520.56,"to":1521.92,"location":2,"content":"Um, that's a choice."},{"from":1521.92,"to":1523.77,"location":2,"content":"You could have the embeddings be for example,"},{"from":1523.77,"to":1527.37,"location":2,"content":"pre-generated embeddings that you download and you use those and they're frozen,"},{"from":1527.37,"to":1528.75,"location":2,"content":"or maybe you could download them,"},{"from":1528.75,"to":1530.19,"location":2,"content":"but then you could fine-tune them."},{"from":1530.19,"to":1532.2,"location":2,"content":"That is, allow them to be changed as parameters of"},{"from":1532.2,"to":1535.17,"location":2,"content":"the network or you could initialize them to,"},{"from":1535.17,"to":1538.56,"location":2,"content":"you know, small, uh, random values and learn them from scratch."},{"from":1538.56,"to":1540.57,"location":2,"content":"Any other questions? Yeah."},{"from":1540.57,"to":1543.69,"location":2,"content":"So you said you use the same delta matrix,"},{"from":1543.69,"to":1545.49,"location":2,"content":"like you do back propagation,"},{"from":1545.49,"to":1548.03,"location":2,"content":"does that you only update like WE,"},{"from":1548.03,"to":1551.08,"location":2,"content":"or do you update both WH and WE?"},{"from":1551.08,"to":1556.09,"location":2,"content":"So the question is, you say we reuse the matrix, do we update WE and WH, or just one?"},{"from":1556.09,"to":1558.98,"location":2,"content":"So you suddenly learn both WE and WH."},{"from":1558.98,"to":1561.41,"location":2,"content":"I suppose I was emphasizing WH more, but yeah,"},{"from":1561.41,"to":1564.09,"location":2,"content":"they're both matrices that are applied repeatedly."},{"from":1564.09,"to":1565.5,"location":2,"content":"There was also a question about back-prop,"},{"from":1565.5,"to":1567.67,"location":2,"content":"but we're going to cover that later in this lecture."},{"from":1567.67,"to":1572.25,"location":2,"content":"Okay, moving on for now. Um, so,"},{"from":1572.25,"to":1577.53,"location":2,"content":"what are some advantages and disadvantages of this RNN language model?"},{"from":1577.53,"to":1583.01,"location":2,"content":"So here are some advantages that we can see in comparison to the fixed window one."},{"from":1583.01,"to":1588.21,"location":2,"content":"So an obvious advantage is that this RNN can process any length of input."},{"from":1588.21,"to":1591.18,"location":2,"content":"Another advantage is that the computation for"},{"from":1591.18,"to":1595.05,"location":2,"content":"step t can in theory use information from many steps back."},{"from":1595.05,"to":1596.73,"location":2,"content":"So in our motivation example,"},{"from":1596.73,"to":1598.65,"location":2,"content":"which was as the proctor started the clock,"},{"from":1598.65,"to":1599.97,"location":2,"content":"the students opened their."},{"from":1599.97,"to":1602.25,"location":2,"content":"We think that proctor and maybe clock are"},{"from":1602.25,"to":1605.34,"location":2,"content":"both pretty important hints for what might be coming up next."},{"from":1605.34,"to":1607.28,"location":2,"content":"So, at least in theory,"},{"from":1607.28,"to":1609.39,"location":2,"content":"the hidden state at the end"},{"from":1609.39,"to":1615.35,"location":2,"content":"can have access to the information from the input from many steps ago."},{"from":1615.35,"to":1619.79,"location":2,"content":"Another advantage is that the model size doesn't increase for longer inputs."},{"from":1619.79,"to":1622.48,"location":2,"content":"So, uh, the size of the model is actually fixed."},{"from":1622.48,"to":1625.01,"location":2,"content":"It's just WH and WE,s"},{"from":1625.01,"to":1629.4,"location":2,"content":"and then also the biases and also the embedding matrix, if you're counting that."},{"from":1629.4,"to":1633,"location":2,"content":"None of those get bigger if you want to apply it to more,"},{"from":1633,"to":1638.03,"location":2,"content":"uh, longer inputs because you just apply the same weights repeatedly."},{"from":1638.03,"to":1643.99,"location":2,"content":"And another advantage is that you have the same weights applied on every time-step."},{"from":1643.99,"to":1649.42,"location":2,"content":"So I said this thing before about how the fixed-sized window neural model,"},{"from":1649.42,"to":1651.72,"location":2,"content":"it was less efficient because it was applying"},{"from":1651.72,"to":1654.27,"location":2,"content":"different weights of the weight matrix to the different,"},{"from":1654.27,"to":1655.9,"location":2,"content":"uh, words in the window."},{"from":1655.9,"to":1658.47,"location":2,"content":"And the advantage about this RNN is that it's"},{"from":1658.47,"to":1661.65,"location":2,"content":"applying the exact same transformation to each of the inputs."},{"from":1661.65,"to":1665.84,"location":2,"content":"So this means that if it learns a good way to process one input,"},{"from":1665.84,"to":1668.01,"location":2,"content":"that is applied to every input in the sequence."},{"from":1668.01,"to":1671.48,"location":2,"content":"So you can see it as more efficient in that way."},{"from":1671.48,"to":1674.81,"location":2,"content":"Okay, so what are the disadvantages of this model?"},{"from":1674.81,"to":1678.27,"location":2,"content":"One is that recurrent computation is pretty slow."},{"from":1678.27,"to":1679.99,"location":2,"content":"Uh, as you saw before,"},{"from":1679.99,"to":1683.87,"location":2,"content":"you have to compute the hidden state based on the previous hidden state."},{"from":1683.87,"to":1686.92,"location":2,"content":"So this means that you can't compute all of the hidden states in parallel."},{"from":1686.92,"to":1688.66,"location":2,"content":"You have to compute them in sequence."},{"from":1688.66,"to":1693.12,"location":2,"content":"So, especially if you're trying to compute an RNN over a pretty long sequence of inputs,"},{"from":1693.12,"to":1696.66,"location":2,"content":"this means that the RNN can be pretty slow to compute."},{"from":1696.66,"to":1700.42,"location":2,"content":"Another disadvantage of RNNs is that it tuns out,"},{"from":1700.42,"to":1704.17,"location":2,"content":"in practice, it's quite difficult to access information from many steps back."},{"from":1704.17,"to":1706.29,"location":2,"content":"So even though I said we should be able to remember about"},{"from":1706.29,"to":1708.93,"location":2,"content":"the proctor and the clock and use that to predict exams and our books,"},{"from":1708.93,"to":1710.43,"location":2,"content":"it turns out that RNNs,"},{"from":1710.43,"to":1712.47,"location":2,"content":"at least the ones that I've presented in this lecture,"},{"from":1712.47,"to":1715.31,"location":2,"content":"are not as good as that as you would think."},{"from":1715.31,"to":1719.3,"location":2,"content":"Um, we're gonna learn more about both of these disadvantages later in the course,"},{"from":1719.3,"to":1722.61,"location":2,"content":"and we're going to learn something about how you can try to fix them."},{"from":1722.61,"to":1726.9,"location":2,"content":"Have we gotten any questions at this point? Yep."},{"from":1726.9,"to":1728.01,"location":2,"content":"Why do we assume that WH are the same?"},{"from":1728.01,"to":1731.27,"location":2,"content":"Sorry, can you speak up?"},{"from":1731.27,"to":1735.9,"location":2,"content":"Why do we assume that the WH should be the same?"},{"from":1735.9,"to":1739.63,"location":2,"content":"So the question is, why should you assume that the WH are the same?"},{"from":1739.63,"to":1741.45,"location":2,"content":"I suppose, it's not exactly an assumption,"},{"from":1741.45,"to":1744.39,"location":2,"content":"it's more a deliberate decision in the design of an RNN."},{"from":1744.39,"to":1746.46,"location":2,"content":"So, an RNN is by definition,"},{"from":1746.46,"to":1750.45,"location":2,"content":"a network where you apply the exact same weights on every step."},{"from":1750.45,"to":1753.8,"location":2,"content":"So, I suppose the question why do you assume maybe should be,"},{"from":1753.8,"to":1755.22,"location":2,"content":"why is that a good idea?"},{"from":1755.22,"to":1757.52,"location":2,"content":"Um, so I spoke a little bit about why it's a good idea,"},{"from":1757.52,"to":1758.69,"location":2,"content":"and this list of advantages,"},{"from":1758.69,"to":1764.56,"location":2,"content":"I suppose, are the reasons why you'd want to do that. Does that answer your question?"},{"from":1764.56,"to":1769.02,"location":2,"content":"Open their books, right? If you assume that WH are the same,"},{"from":1769.02,"to":1771.42,"location":2,"content":"you mean that like, uh,"},{"from":1771.42,"to":1774.66,"location":2,"content":"Markov chain, it's like a Markov chain."},{"from":1774.66,"to":1777.78,"location":2,"content":"Uh, the trans- transmit, uh,"},{"from":1777.78,"to":1782.95,"location":2,"content":"trans- transfer probability for the human moods open,"},{"from":1782.95,"to":1784.89,"location":2,"content":"they are the same,"},{"from":1784.89,"to":1790.94,"location":2,"content":"but actually the Markov chain."},{"from":1790.94,"to":1796.54,"location":2,"content":"The model, [inaudible] the transfer probability for that is the same,"},{"from":1796.54,"to":1800.89,"location":2,"content":"so [inaudible] probability,"},{"from":1800.89,"to":1807.11,"location":2,"content":"it- it's just an approximation but it's another test."},{"from":1807.11,"to":1808.24,"location":2,"content":"Okay. So I think that [OVERLAPPING]"},{"from":1808.24,"to":1810.81,"location":2,"content":"If you assume WH could be the same,"},{"from":1810.81,"to":1814.72,"location":2,"content":"it's good because you used a number of parameters,"},{"from":1814.72,"to":1820.56,"location":2,"content":"but this is just an, this is just an approximation."},{"from":1820.56,"to":1823.41,"location":2,"content":"The underlying transfer, uh,"},{"from":1823.41,"to":1825.66,"location":2,"content":"probability, it shouldn't be the same. Especially [OVERLAPPING]"},{"from":1825.66,"to":1828.84,"location":2,"content":"Okay. Um, so I think the question is saying that given the- these"},{"from":1828.84,"to":1830.54,"location":2,"content":"words the students opened their"},{"from":1830.54,"to":1832.49,"location":2,"content":"are all different and they're happening in different context,"},{"from":1832.49,"to":1835.85,"location":2,"content":"then why should we be applying the same transformation each time?"},{"from":1835.85,"to":1837.44,"location":2,"content":"So that's a- that's a good question."},{"from":1837.44,"to":1841.67,"location":2,"content":"I think, uh, the idea is that you are learning a general function, not just, you know,"},{"from":1841.67,"to":1843.54,"location":2,"content":"how to deal with students,"},{"from":1843.54,"to":1846.09,"location":2,"content":"the one-word students in this one context."},{"from":1846.09,"to":1848.52,"location":2,"content":"We're trying to learn a general function of how you"},{"from":1848.52,"to":1851.07,"location":2,"content":"should deal with a word given the word so far."},{"from":1851.07,"to":1855.09,"location":2,"content":"You're trying to learn a general representation of language and context so far,"},{"from":1855.09,"to":1857.06,"location":2,"content":"which is indeed a very difficult problem."},{"from":1857.06,"to":1860.17,"location":2,"content":"Um, I think you also mentioned that something about an approximation."},{"from":1860.17,"to":1861.78,"location":2,"content":"Uh, another thing to note is that all of"},{"from":1861.78,"to":1864.57,"location":2,"content":"the hidden states are vectors, they're not just single numbers, right?"},{"from":1864.57,"to":1866.67,"location":2,"content":"They are vectors of lengths, I don't know, 500 or something?"},{"from":1866.67,"to":1869.61,"location":2,"content":"So they have quite a large capacity to hold lots of information about"},{"from":1869.61,"to":1873.53,"location":2,"content":"different things in all of their different, um, positions."},{"from":1873.53,"to":1875.63,"location":2,"content":"So, I think the idea is that you can"},{"from":1875.63,"to":1878.26,"location":2,"content":"store a lot of different information in different contexts,"},{"from":1878.26,"to":1879.83,"location":2,"content":"in different parts of the hidden state,"},{"from":1879.83,"to":1881.96,"location":2,"content":"but it is indeed an approximation and there is"},{"from":1881.96,"to":1884.58,"location":2,"content":"some kind of limit to how much information you can store."},{"from":1884.58,"to":1886.85,"location":2,"content":"Okay, any other questions? Yes."},{"from":1886.85,"to":1889.41,"location":2,"content":"Since you kinda process any single length frame,"},{"from":1889.41,"to":1891.13,"location":2,"content":"what length do you use during your training?"},{"from":1891.13,"to":1895.04,"location":2,"content":"And does the length you use for training affect WH?"},{"from":1895.04,"to":1899.36,"location":2,"content":"Okay, so, the question is, given that you can have any length input,"},{"from":1899.36,"to":1901.95,"location":2,"content":"what length is the input during training?"},{"from":1901.95,"to":1904.18,"location":2,"content":"So, I suppose in practice,"},{"from":1904.18,"to":1906.51,"location":2,"content":"you choose how long the inputs are in"},{"from":1906.51,"to":1909.63,"location":2,"content":"training either based on what your data is or maybe based on,"},{"from":1909.63,"to":1912.62,"location":2,"content":"uh, your efficiency concerns so maybe you make it artificially"},{"from":1912.62,"to":1915.9,"location":2,"content":"shorter by chopping it up. Um, what was the other question?"},{"from":1915.9,"to":1918.36,"location":2,"content":"Uh, does WH depend on that?"},{"from":1918.36,"to":1921.26,"location":2,"content":"Okay. So the question was, does WH depend on the length you used?"},{"from":1921.26,"to":1924.08,"location":2,"content":"So, no, and that's one of the good things in the advantages list."},{"from":1924.08,"to":1927.16,"location":2,"content":"Is that the model size doesn't increase for longer input,"},{"from":1927.16,"to":1929.04,"location":2,"content":"because we just unroll the RNN"},{"from":1929.04,"to":1931.24,"location":2,"content":"applying the same weights again and again for as long as we'd like."},{"from":1931.24,"to":1933.93,"location":2,"content":"There's no need to have more weights just because you have a longer input."},{"from":1933.93,"to":1936.8,"location":2,"content":"[NOISE] Yeah."},{"from":1936.8,"to":1944.23,"location":2,"content":"So how the ratios that you mentioned are [inaudible] the number of words."},{"from":1944.23,"to":1948.4,"location":2,"content":"[NOISE] Are you asking about capital E or the lowercase E?"},{"from":1948.4,"to":1949.48,"location":2,"content":"Uh, lowercase E."},{"from":1949.48,"to":1950.79,"location":2,"content":"Okay. So, the question is,"},{"from":1950.79,"to":1952.89,"location":2,"content":"how do we choose the dimension of the lowercase Es?"},{"from":1952.89,"to":1954.3,"location":2,"content":"Uh, so, you could, for example,"},{"from":1954.3,"to":1957.12,"location":2,"content":"assume that those are just pre-trained word vectors like the ones that you,"},{"from":1957.12,"to":1958.82,"location":2,"content":"uh, used in assignment one."},{"from":1958.82,"to":1959.71,"location":2,"content":"More like word2vec."},{"from":1959.71,"to":1961.14,"location":2,"content":"Yeah. For example, word2vec,"},{"from":1961.14,"to":1962.61,"location":2,"content":"and you just download them and use them,"},{"from":1962.61,"to":1964.38,"location":2,"content":"or maybe you learn them from scratch, in which case,"},{"from":1964.38,"to":1966.93,"location":2,"content":"you decide at the beginning of training how big you want those vectors to be."},{"from":1966.93,"to":1969.21,"location":2,"content":"[NOISE] Okay. I'm gonna move on for now."},{"from":1969.21,"to":1974.89,"location":2,"content":"[NOISE] So, we've learned what an RNN language model is and we've learned how you would,"},{"from":1974.89,"to":1976.85,"location":2,"content":"uh, run one forward, but the question remains,"},{"from":1976.85,"to":1979.08,"location":2,"content":"how would you train an RNN language model?"},{"from":1979.08,"to":1982.23,"location":2,"content":"How would you learn it? [NOISE]"},{"from":1982.23,"to":1983.85,"location":2,"content":"So, as always, in machine learning,"},{"from":1983.85,"to":1986.67,"location":2,"content":"our answer starts with, you're going to get a big corpus of text,"},{"from":1986.67,"to":1991.23,"location":2,"content":"and we're gonna call that just a sequence of words X1 up to X capital T. So,"},{"from":1991.23,"to":1995.12,"location":2,"content":"you feed the sequence of words into the RNN language model, and then,"},{"from":1995.12,"to":1999.62,"location":2,"content":"the idea is that you compute the output distribution Y-hat T for every step T. So,"},{"from":1999.62,"to":2001.7,"location":2,"content":"I know that the picture I showed on the previous, uh,"},{"from":2001.7,"to":2003.56,"location":2,"content":"slide [NOISE] only showed us doing on the last step,"},{"from":2003.56,"to":2006.14,"location":2,"content":"but the idea is, you would actually compute this on every step."},{"from":2006.14,"to":2008.42,"location":2,"content":"So, this means that you're actually predicting"},{"from":2008.42,"to":2011,"location":2,"content":"the probability of the next word on every step."},{"from":2011,"to":2013.13,"location":2,"content":"[NOISE] Okay."},{"from":2013.13,"to":2015.52,"location":2,"content":"So, once you've done that, then you can define the loss function,"},{"from":2015.52,"to":2017.12,"location":2,"content":"and this should be familiar to you by now."},{"from":2017.12,"to":2019.19,"location":2,"content":"Uh, this is the cross-entropy between [NOISE]"},{"from":2019.19,"to":2023.91,"location":2,"content":"our predicted probability distribution Y-hat T and the true, uh,"},{"from":2023.91,"to":2027.26,"location":2,"content":"distribution, which is Y-hat- sorry, just YT,"},{"from":2027.26,"to":2029.57,"location":2,"content":"which is a one-hot vector, uh,"},{"from":2029.57,"to":2031.06,"location":2,"content":"representing the true next [NOISE] words,"},{"from":2031.06,"to":2032.49,"location":2,"content":"which is XT plus one."},{"from":2032.49,"to":2034.49,"location":2,"content":"So, as you've seen before, this, uh,"},{"from":2034.49,"to":2037.1,"location":2,"content":"cross-entropy [NOISE] between those two vectors can be written"},{"from":2037.1,"to":2040.64,"location":2,"content":"also as a negative log probability."},{"from":2040.64,"to":2045.63,"location":2,"content":"And then, lastly, if you average this cross-entropy loss across every step, uh,"},{"from":2045.63,"to":2048.74,"location":2,"content":"every T in the corpus time step T, then,"},{"from":2048.74,"to":2051.8,"location":2,"content":"uh, this gives you your overall loss for the entire training set."},{"from":2051.8,"to":2056.36,"location":2,"content":"[NOISE] Okay."},{"from":2056.36,"to":2058.47,"location":2,"content":"So, just to make that even more clear with a picture,"},{"from":2058.47,"to":2060.08,"location":2,"content":"uh, suppose that our corpus is,"},{"from":2060.08,"to":2061.37,"location":2,"content":"the students open their exams,"},{"from":2061.37,"to":2063.02,"location":2,"content":"et cetera, and it goes on for a long time."},{"from":2063.02,"to":2064.55,"location":2,"content":"Then, what we'd be doing is,"},{"from":2064.55,"to":2066.98,"location":2,"content":"we'd be running our RNN over this text, and then,"},{"from":2066.98,"to":2070.53,"location":2,"content":"on every step, we would be predicting the probability [NOISE] distribution Y-hats,"},{"from":2070.53,"to":2071.78,"location":2,"content":"and then, from each of those,"},{"from":2071.78,"to":2073.31,"location":2,"content":"you can calculate what your loss is,"},{"from":2073.31,"to":2076.4,"location":2,"content":"which is the JT, and then, uh, on the first step,"},{"from":2076.4,"to":2078.97,"location":2,"content":"the loss would be the negative log probability of the next word,"},{"from":2078.97,"to":2080.06,"location":2,"content":"which is, in this example,"},{"from":2080.06,"to":2082.04,"location":2,"content":"students, [NOISE] and so on."},{"from":2082.04,"to":2085.07,"location":2,"content":"Each of those is the negative log probability of the next word."},{"from":2085.07,"to":2087.51,"location":2,"content":"[NOISE] And then, once you've computed all of those,"},{"from":2087.51,"to":2089.59,"location":2,"content":"you can add them [NOISE] all up and average them,"},{"from":2089.59,"to":2091.16,"location":2,"content":"and then, this gives you your final loss."},{"from":2091.16,"to":2096.26,"location":2,"content":"[NOISE] Okay. So, there's a caveat here."},{"from":2096.26,"to":2099.93,"location":2,"content":"Um, computing the loss and gradients across the entire corpus,"},{"from":2099.93,"to":2102.35,"location":2,"content":"all of those words X1 up to X capital T is too"},{"from":2102.35,"to":2104.84,"location":2,"content":"expensive [NOISE] because your corpus is probably really big."},{"from":2104.84,"to":2107.81,"location":2,"content":"[NOISE] So, um, as a student asked earlier,"},{"from":2107.81,"to":2110.55,"location":2,"content":"uh, in practice, what do you actually regard as your sequence?"},{"from":2110.55,"to":2112.58,"location":2,"content":"So, in practice, you might regard your sequence as, uh,"},{"from":2112.58,"to":2114.59,"location":2,"content":"something like a sentence or a document,"},{"from":2114.59,"to":2117.43,"location":2,"content":"some shorter unit of text."},{"from":2117.43,"to":2120.89,"location":2,"content":"So, uh, another thing you'll do [NOISE] is, if you remember,"},{"from":2120.89,"to":2123.78,"location":2,"content":"stochastic gradient descent allows you to compute gradients"},{"from":2123.78,"to":2126.98,"location":2,"content":"for small chunks of data rather than the whole corpus at a time."},{"from":2126.98,"to":2129.28,"location":2,"content":"So, in practice, if you're training a language model,"},{"from":2129.28,"to":2132.83,"location":2,"content":"what you're actually likely to be doing is computing the loss for a sentence,"},{"from":2132.83,"to":2135.29,"location":2,"content":"but that's actually a batch of sentences, and then,"},{"from":2135.29,"to":2137.95,"location":2,"content":"you compute the gradients with respect to that batch of sentences,"},{"from":2137.95,"to":2139.76,"location":2,"content":"update your weights, and repeat."},{"from":2139.76,"to":2146.41,"location":2,"content":"Any questions at this point? [NOISE] Okay."},{"from":2146.41,"to":2148.04,"location":2,"content":"So, uh, moving onto backprop."},{"from":2148.04,"to":2151.05,"location":2,"content":"Don't worry, there won't be as much backprop as there was last week,"},{"from":2151.05,"to":2153.23,"location":2,"content":"but, uh, there's an interesting question here, right?"},{"from":2153.23,"to":2155.9,"location":2,"content":"So, the, uh, characteristic thing about RNNs"},{"from":2155.9,"to":2158.97,"location":2,"content":"is that they apply the same weight matrix repeatedly."},{"from":2158.97,"to":2160.28,"location":2,"content":"So, the question is,"},{"from":2160.28,"to":2162.22,"location":2,"content":"[NOISE] what's the derivative of our loss function,"},{"from":2162.22,"to":2163.61,"location":2,"content":"let's say, on step T?"},{"from":2163.61,"to":2168.64,"location":2,"content":"What's the derivative of that loss with respect to the repeated weight matrix WH?"},{"from":2168.64,"to":2173.57,"location":2,"content":"So, the answer is that the derivative of the loss, uh,"},{"from":2173.57,"to":2176.39,"location":2,"content":"the gradient with respect to the repeated weight is"},{"from":2176.39,"to":2179.78,"location":2,"content":"the sum of the gradient with respect to each time it appears,"},{"from":2179.78,"to":2181.36,"location":2,"content":"and that's what that equation says."},{"from":2181.36,"to":2185.61,"location":2,"content":"So, on the right, the notation with the vertical line and the I is saying, uh,"},{"from":2185.61,"to":2190.67,"location":2,"content":"the derivative of the loss with respect to WH when it appears on the Ith step."},{"from":2190.67,"to":2192.77,"location":2,"content":"Okay. So, so, why is that true?"},{"from":2192.77,"to":2195.26,"location":2,"content":"[NOISE] Uh, to sketch why this is true,"},{"from":2195.26,"to":2197.84,"location":2,"content":"uh, [NOISE] I'm gonna remind you of the multivariable chain rule."},{"from":2197.84,"to":2202.53,"location":2,"content":"So, uh, this is a screenshot from a Khan Academy article on the multivariable chain rule,"},{"from":2202.53,"to":2204.44,"location":2,"content":"and, uh, I advise you check it out if you"},{"from":2204.44,"to":2206.63,"location":2,"content":"want to learn more because it's very easy to understand."},{"from":2206.63,"to":2208.22,"location":2,"content":"Uh, and what it says is,"},{"from":2208.22,"to":2212.05,"location":2,"content":"given a function F [NOISE] which depends on X and Y,"},{"from":2212.05,"to":2216.14,"location":2,"content":"which are both themselves functions of some variable T, then,"},{"from":2216.14,"to":2219.43,"location":2,"content":"if you want to get the derivative of F with respect to T,"},{"from":2219.43,"to":2224.38,"location":2,"content":"then you need to do the chain ru- rule across X and Y separately and then add them up."},{"from":2224.38,"to":2227.02,"location":2,"content":"[NOISE] So, that's the multivariable chain rule,"},{"from":2227.02,"to":2230.51,"location":2,"content":"[NOISE] and if we apply this to our scenario with trying to take"},{"from":2230.51,"to":2234.89,"location":2,"content":"the derivative of the loss JT with respect to our weight matrix WH,"},{"from":2234.89,"to":2239.3,"location":2,"content":"then you could view it as this kind of diagram [NOISE] where WH has, uh,"},{"from":2239.3,"to":2242.81,"location":2,"content":"a relationship with all of these individual appearances of WH,"},{"from":2242.81,"to":2243.86,"location":2,"content":"but it's a [NOISE] simple relationship,"},{"from":2243.86,"to":2245.49,"location":2,"content":"it's just equality, and then,"},{"from":2245.49,"to":2249.69,"location":2,"content":"each of those appearances of WH affect the loss in different ways."},{"from":2249.69,"to":2254.08,"location":2,"content":"So, then, if we apply the multivariable chain rule,"},{"from":2254.08,"to":2257.47,"location":2,"content":"then it says that the derivative of the loss with respect to"},{"from":2257.47,"to":2261.19,"location":2,"content":"WH is the sum of those chain rule things,"},{"from":2261.19,"to":2265.6,"location":2,"content":"but the expression on the right is just one because it's an equality relation,"},{"from":2265.6,"to":2270.48,"location":2,"content":"[NOISE] and then, that gives us the equation that I wrote on the previous slide."},{"from":2270.48,"to":2275.24,"location":2,"content":"So, this is a proof sketch for why the derivative of the loss with"},{"from":2275.24,"to":2280.57,"location":2,"content":"respect to our recurrent matrix is the sum of the derivatives each time it appears."},{"from":2280.57,"to":2283.19,"location":2,"content":"Okay. So, suppose you believe me on that, that is,"},{"from":2283.19,"to":2284.55,"location":2,"content":"how you compute the, uh,"},{"from":2284.55,"to":2286.47,"location":2,"content":"gradient with respect to the recurrent weight."},{"from":2286.47,"to":2288.44,"location":2,"content":"So, a remaining question is, well,"},{"from":2288.44,"to":2290.72,"location":2,"content":"how [NOISE] do we actually calculate this in practice?"},{"from":2290.72,"to":2296.66,"location":2,"content":"[NOISE] So, the answer is that you're going to calculate this sum by doing backprop,"},{"from":2296.66,"to":2299.39,"location":2,"content":"uh, backwards, kind of right to left, um,"},{"from":2299.39,"to":2303.59,"location":2,"content":"through the RNN, and you're going to accumulate this sum as you go."},{"from":2303.59,"to":2304.94,"location":2,"content":"So, the important thing is,"},{"from":2304.94,"to":2308.43,"location":2,"content":"you shouldn't compute each of those things separately, uh,"},{"from":2308.43,"to":2310.88,"location":2,"content":"you should compute them by accumulating, like,"},{"from":2310.88,"to":2314.36,"location":2,"content":"each one can be computed in form- in terms of the previous one."},{"from":2314.36,"to":2319.13,"location":2,"content":"[NOISE] So, this algorithm of computing each of these,"},{"from":2319.13,"to":2321.32,"location":2,"content":"uh, each of these gradients with respect to"},{"from":2321.32,"to":2324.3,"location":2,"content":"the previous one is called backpropagation through time."},{"from":2324.3,"to":2327.65,"location":2,"content":"And, um, I always think that this sounds way more sci-fi than it is."},{"from":2327.65,"to":2329.03,"location":2,"content":"It sounds like it's time travel or something,"},{"from":2329.03,"to":2330.56,"location":2,"content":"but it's actually pretty simple."},{"from":2330.56,"to":2333.29,"location":2,"content":"Uh, it's just the name you give to"},{"from":2333.29,"to":2337.96,"location":2,"content":"applying the backprop algorithm to a recurrent neural network."},{"from":2337.96,"to":2342.35,"location":2,"content":"Any questions at this point? Yep. [NOISE]"},{"from":2342.35,"to":2347.24,"location":2,"content":"So, it seems that how you break up the batches matter your end result."},{"from":2347.24,"to":2355.7,"location":2,"content":"[inaudible]."},{"from":2355.7,"to":2361.46,"location":2,"content":"So, if you break it into much more [inaudible]."},{"from":2361.46,"to":2363.61,"location":2,"content":"Okay. So the question is, um, surely,"},{"from":2363.61,"to":2367.86,"location":2,"content":"how you decide to break up your batches affects how you learn, right?"},{"from":2367.86,"to":2369.56,"location":2,"content":"Because if you choose, uh,"},{"from":2369.56,"to":2371.66,"location":2,"content":"one set of data to be your batch, right, then,"},{"from":2371.66,"to":2373.88,"location":2,"content":"you will make your update based on that, and then,"},{"from":2373.88,"to":2376.76,"location":2,"content":"you only update the next one based on [NOISE] where you go from there."},{"from":2376.76,"to":2378.95,"location":2,"content":"So, if you decided to put different data in the batch,"},{"from":2378.95,"to":2380.49,"location":2,"content":"then you would have made a different step."},{"from":2380.49,"to":2382.91,"location":2,"content":"So, that's true, [NOISE] and that is why"},{"from":2382.91,"to":2385.91,"location":2,"content":"stochastic gradient descent is only an approximation of"},{"from":2385.91,"to":2389.66,"location":2,"content":"true gradient descent because the gradient that you compute with"},{"from":2389.66,"to":2393.95,"location":2,"content":"respect to one batch is just an approximation of the true gradient with respect to the,"},{"from":2393.95,"to":2396.09,"location":2,"content":"uh, the loss over the whole corpus."},{"from":2396.09,"to":2398.16,"location":2,"content":"So, yes, it's true that it's an approximation"},{"from":2398.16,"to":2400.58,"location":2,"content":"and how [NOISE] you choose to batch up your data can matter,"},{"from":2400.58,"to":2403.04,"location":2,"content":"and that's why, for example, shuffling your data is a good idea,"},{"from":2403.04,"to":2405.57,"location":2,"content":"and shuffling it differently, each epoch, is a good idea."},{"from":2405.57,"to":2409.13,"location":2,"content":"Uh, but the, the core idea of SGD is [NOISE] that, um,"},{"from":2409.13,"to":2412.09,"location":2,"content":"it should be a good enough approximation that over many steps,"},{"from":2412.09,"to":2414.74,"location":2,"content":"you will, uh, minimize your loss."},{"from":2414.74,"to":2433.01,"location":2,"content":"[NOISE] Any other questions? [NOISE] Yeah."},{"from":2433.01,"to":2435.41,"location":2,"content":"[NOISE] So, is, uh, is the question,"},{"from":2435.41,"to":2437.18,"location":2,"content":"as you compute forward prop,"},{"from":2437.18,"to":2440.34,"location":2,"content":"do you start computing backprop before you've even, like, got to the loss?"},{"from":2440.34,"to":2441.62,"location":2,"content":"Is that the question? [NOISE]"},{"from":2441.62,"to":2442.32,"location":2,"content":"Yes."},{"from":2442.32,"to":2445.64,"location":2,"content":"I didn't think so, right? Because you need to know what the loss is in"},{"from":2445.64,"to":2449.03,"location":2,"content":"order to compute the derivative of the loss with respect to something."},{"from":2449.03,"to":2450.56,"location":2,"content":"So, I think you need to get to the end."},{"from":2450.56,"to":2451.76,"location":2,"content":"So, if we assume simplicity,"},{"from":2451.76,"to":2454.49,"location":2,"content":"that there is only one loss which you get at the end of several steps,"},{"from":2454.49,"to":2455.59,"location":2,"content":"then you need to get to the end,"},{"from":2455.59,"to":2459.36,"location":2,"content":"compute the loss before you can compute the derivatives."},{"from":2459.36,"to":2462.2,"location":2,"content":"But I suppose you, you, you could compute the derivative of two,"},{"from":2462.2,"to":2464.24,"location":2,"content":"kind of, adjacent things of one with respect to the other."},{"from":2464.24,"to":2465.47,"location":2,"content":"[OVERLAPPING] But, yeah. [NOISE]"},{"from":2465.47,"to":2467.78,"location":2,"content":"As you're going forward, do- you need to sort of keep a track of what,"},{"from":2467.78,"to":2473.72,"location":2,"content":"what you would have [inaudible] the one you eventually get the loss. [inaudible]"},{"from":2473.72,"to":2475.86,"location":2,"content":"Yes. So, when you forward prop,"},{"from":2475.86,"to":2479.66,"location":2,"content":"you certainly have to hang on to all of the intervening factors."},{"from":2479.66,"to":2480.68,"location":2,"content":"[NOISE] Okay. I'm gonna move on for now."},{"from":2480.68,"to":2484.79,"location":2,"content":"Uh, so, that was a maths-heavy bit but,"},{"from":2484.79,"to":2487.13,"location":2,"content":"um, now, we're getting on to text generation,"},{"from":2487.13,"to":2488.68,"location":2,"content":"which someone asked about earlier."},{"from":2488.68,"to":2492.97,"location":2,"content":"So, um, just as we use the n-gram language model to generate text,"},{"from":2492.97,"to":2496.11,"location":2,"content":"you can also use an RNN language model to generate text,"},{"from":2496.11,"to":2498.65,"location":2,"content":"uh, via the same repeated sampling technique."},{"from":2498.65,"to":2501.05,"location":2,"content":"Um, so, here's a picture of how that would work."},{"from":2501.05,"to":2503.99,"location":2,"content":"How you start off with your initial hidden state H0, uh,"},{"from":2503.99,"to":2506.33,"location":2,"content":"which, uh, we have either as a parameter of"},{"from":2506.33,"to":2509.06,"location":2,"content":"the model or we initialize it to zero, or something like that."},{"from":2509.06,"to":2511.34,"location":2,"content":"So, let's suppose that we have the first word my,"},{"from":2511.34,"to":2514.24,"location":2,"content":"and Iet's suppose I, um, supply that to the model."},{"from":2514.24,"to":2517.24,"location":2,"content":"So, then, using the inputs and the initial hidden state,"},{"from":2517.24,"to":2519.2,"location":2,"content":"you can get our first hidden state H1."},{"from":2519.2,"to":2521.55,"location":2,"content":"And then from there, we can compute the, er,"},{"from":2521.55,"to":2524.76,"location":2,"content":"probability distribution Y hat one of what's coming next,"},{"from":2524.76,"to":2527.43,"location":2,"content":"and then we can use that distribution to sample some word."},{"from":2527.43,"to":2529.39,"location":2,"content":"So let's suppose that we sampled the word favorite."},{"from":2529.39,"to":2534.2,"location":2,"content":"So, the idea is that we use the outputted word as the input on the next step."},{"from":2534.2,"to":2536.96,"location":2,"content":"So, we feed favorite into the second step of the RNN,"},{"from":2536.96,"to":2538.22,"location":2,"content":"we get a new hidden state,"},{"from":2538.22,"to":2540.78,"location":2,"content":"and again we get a new probability distribution,"},{"from":2540.78,"to":2542.89,"location":2,"content":"and from that we can sample a new word."},{"from":2542.89,"to":2545.68,"location":2,"content":"So, we can just continue doing this process again and again,"},{"from":2545.68,"to":2547.68,"location":2,"content":"and in this way we can generate some text."},{"from":2547.68,"to":2549.5,"location":2,"content":"So, uh, here we've generated the text,"},{"from":2549.5,"to":2550.76,"location":2,"content":"My favorite season is Spring,"},{"from":2550.76,"to":2556.06,"location":2,"content":"and we can keep going for as long as we'd like."},{"from":2556.06,"to":2559.13,"location":2,"content":"Okay, so, uh, let's have some fun with this."},{"from":2559.13,"to":2561.39,"location":2,"content":"Uh, you can generate,"},{"from":2561.39,"to":2563.89,"location":2,"content":"uh, text using an RNN language model."},{"from":2563.89,"to":2568.07,"location":2,"content":"If you train the RNN language model on any kind of text,"},{"from":2568.07,"to":2571.34,"location":2,"content":"then you can use it to generate text in that style."},{"from":2571.34,"to":2573.38,"location":2,"content":"And in fact, this has become a whole kind of"},{"from":2573.38,"to":2575.78,"location":2,"content":"genre of internet humor that you might've seen."},{"from":2575.78,"to":2577.59,"location":2,"content":"So, uh, for example,"},{"from":2577.59,"to":2580.93,"location":2,"content":"here is an RNN language model trained on Obama speeches,"},{"from":2580.93,"to":2583.1,"location":2,"content":"and I found this in a blog post online."},{"from":2583.1,"to":2587.12,"location":2,"content":"So, here's the text that the RNN language model generated."},{"from":2587.12,"to":2591.35,"location":2,"content":"\"The United States will step up to the cost of a new challenges of"},{"from":2591.35,"to":2595.52,"location":2,"content":"the American people that will share the fact that we created the problem."},{"from":2595.52,"to":2599.63,"location":2,"content":"They were attacked and so that they have to say that"},{"from":2599.63,"to":2604.19,"location":2,"content":"all the task of the final days of war that I will not be able to get this done.\""},{"from":2604.19,"to":2607.13,"location":2,"content":"[LAUGHTER] Okay."},{"from":2607.13,"to":2610.2,"location":2,"content":"So, if we look at this and"},{"from":2610.2,"to":2612.23,"location":2,"content":"especially think about what did"},{"from":2612.23,"to":2614.57,"location":2,"content":"that text look like that we got from the n-gram language model,"},{"from":2614.57,"to":2616.16,"location":2,"content":"the one about the, the price of gold."},{"from":2616.16,"to":2619.72,"location":2,"content":"Um, I'd say that this is kind of recognizably better than that."},{"from":2619.72,"to":2621.62,"location":2,"content":"It seems more fluent overall."},{"from":2621.62,"to":2623.69,"location":2,"content":"Uh, I'd say it has a more of"},{"from":2623.69,"to":2628.53,"location":2,"content":"a sustained context in that it kind of makes sense for longer stretches at a time,"},{"from":2628.53,"to":2631.67,"location":2,"content":"and I'd say it does sound totally like Obama as well."},{"from":2631.67,"to":2633.03,"location":2,"content":"So, all of that's pretty good,"},{"from":2633.03,"to":2635.74,"location":2,"content":"but you can see that it's still pretty incoherent overall,"},{"from":2635.74,"to":2638.93,"location":2,"content":"like i- it was quite difficult to read it because it didn't really make sense, right?"},{"from":2638.93,"to":2640.13,"location":2,"content":"So I had to read the words carefully."},{"from":2640.13,"to":2642.89,"location":2,"content":"Um, so, yeah, I think this shows"},{"from":2642.89,"to":2646.31,"location":2,"content":"some of the progress you can get from using RNNs to generate text but still,"},{"from":2646.31,"to":2649.61,"location":2,"content":"um, very far from human level. Here are some more examples."},{"from":2649.61,"to":2653.28,"location":2,"content":"Uh, here's an RNN language model that was trained on the Harry Potter books."},{"from":2653.28,"to":2657.09,"location":2,"content":"And here's what it said. \"Sorry.\" Harry shouted, panicking."},{"from":2657.09,"to":2659.6,"location":2,"content":"\"I'll leave those brooms in London.\" Are they?"},{"from":2659.6,"to":2661.88,"location":2,"content":"\"No idea.\" said Nearly Headless Nick,"},{"from":2661.88,"to":2663.74,"location":2,"content":"casting low close by Cedric,"},{"from":2663.74,"to":2666.98,"location":2,"content":"carrying the last bit of treacle Charms from Harry's shoulder."},{"from":2666.98,"to":2669.29,"location":2,"content":"And to answer him the common room perched upon it,"},{"from":2669.29,"to":2673.03,"location":2,"content":"four arms held a shining knob from when the Spider hadn't felt it seemed."},{"from":2673.03,"to":2674.86,"location":2,"content":"He reached the teams too.\""},{"from":2674.86,"to":2678.07,"location":2,"content":"So, again, I'd say that this is fairly fluent."},{"from":2678.07,"to":2680,"location":2,"content":"It sounds totally like the Harry Potter books."},{"from":2680,"to":2681.71,"location":2,"content":"In fact, I'm pretty impressed by how much it does"},{"from":2681.71,"to":2684.17,"location":2,"content":"sound like in the voice of the Harry Potter books."},{"from":2684.17,"to":2686.51,"location":2,"content":"You even got some character attributes,"},{"from":2686.51,"to":2690.39,"location":2,"content":"I'd say that Harry the character does often panic in the book so that seems right."},{"from":2690.39,"to":2694.52,"location":2,"content":"Um, [LAUGHTER] but some bad things are that we have,"},{"from":2694.52,"to":2698.66,"location":2,"content":"for example, a pretty long run-on sentence in the second paragraph that's hard to read."},{"from":2698.66,"to":2701.49,"location":2,"content":"Uh, you have some nonsensical things that really make no sense."},{"from":2701.49,"to":2703.2,"location":2,"content":"Like, I don't know what a treacle charm is."},{"from":2703.2,"to":2704.89,"location":2,"content":"It sounds delicious but I don't think it's real,"},{"from":2704.89,"to":2707.79,"location":2,"content":"uh, and overall it's just pretty nonsensical."},{"from":2707.79,"to":2712.86,"location":2,"content":"Here's another example. Here is an RNN language model that was trained on recipes."},{"from":2712.86,"to":2716,"location":2,"content":"So, uh, [LAUGHTER] this one's pretty bizarre,"},{"from":2716,"to":2718.57,"location":2,"content":"the title is 'chocolate ranch barbecue',"},{"from":2718.57,"to":2720.95,"location":2,"content":"It contains Parmesan cheese,"},{"from":2720.95,"to":2725.55,"location":2,"content":"coconut milk, eggs, and the recipe says place each pasta over layers of lumps,"},{"from":2725.55,"to":2729.5,"location":2,"content":"shape mixture into the moderate oven and simmer until firm."},{"from":2729.5,"to":2731.21,"location":2,"content":"Serve hot in bodied fresh,"},{"from":2731.21,"to":2732.57,"location":2,"content":"mustard orange and cheese."},{"from":2732.57,"to":2735.82,"location":2,"content":"Combine the cheese and salt together the dough in a large skillet;"},{"from":2735.82,"to":2738.14,"location":2,"content":"add the ingredients and stir in the chocolate and pepper."},{"from":2738.14,"to":2741.64,"location":2,"content":"[LAUGHTER] Um, so, one thing that I think is"},{"from":2741.64,"to":2745.34,"location":2,"content":"even more clear here in the recipes example than the prose example,"},{"from":2745.34,"to":2749.41,"location":2,"content":"is the inability to remember what's [NOISE] what's happening overall, right?"},{"from":2749.41,"to":2753.02,"location":2,"content":"Cuz a recipe you could say is pretty challenging because you need to remember"},{"from":2753.02,"to":2757.1,"location":2,"content":"the title of what you're trying to make which in this case is chocolate ranch barbecue,"},{"from":2757.1,"to":2759.47,"location":2,"content":"and you need to actually, you know, make that thing by the end."},{"from":2759.47,"to":2761.06,"location":2,"content":"Uh, you also need to remember what were the ingredients"},{"from":2761.06,"to":2762.5,"location":2,"content":"in the beginning and did you use them."},{"from":2762.5,"to":2765.23,"location":2,"content":"And in a recipe, if you make something and put it in the oven,"},{"from":2765.23,"to":2767.72,"location":2,"content":"you need to take it out later, a- and stuff like that, right?"},{"from":2767.72,"to":2769.4,"location":2,"content":"So, clearly it's not really"},{"from":2769.4,"to":2771.89,"location":2,"content":"remembering what's happening overall or what it's trying to do,"},{"from":2771.89,"to":2773.91,"location":2,"content":"it seems to be just generating kind of"},{"from":2773.91,"to":2777.78,"location":2,"content":"generic recipe sentences and putting them in a random order."},{"from":2777.78,"to":2780.64,"location":2,"content":"Uh, but again, I mean, we can see that it's fairly fluent,"},{"from":2780.64,"to":2783.35,"location":2,"content":"it's grammatically right, it kind of sounds like a recipe."},{"from":2783.35,"to":2785.86,"location":2,"content":"Uh, but the problem is it's just nonsensical."},{"from":2785.86,"to":2788.3,"location":2,"content":"Like for example, shape mixture into"},{"from":2788.3,"to":2791.34,"location":2,"content":"the moderate oven is grammatical but it doesn't make any sense."},{"from":2791.34,"to":2793.3,"location":2,"content":"Okay, last example."},{"from":2793.3,"to":2797.51,"location":2,"content":"So, here's an RNN language model that's trained on paint-color names."},{"from":2797.51,"to":2801.2,"location":2,"content":"And this is an example of a character-level language model because"},{"from":2801.2,"to":2804.84,"location":2,"content":"it's predicting what character comes next not what word comes next."},{"from":2804.84,"to":2807.65,"location":2,"content":"And this is why it's able to come up with new words."},{"from":2807.65,"to":2809.84,"location":2,"content":"Another thing to note is that this language model was"},{"from":2809.84,"to":2812.09,"location":2,"content":"trained to be conditioned on some kind of input."},{"from":2812.09,"to":2815.78,"location":2,"content":"So here, the input is the color itself I think represented by the three numbers,"},{"from":2815.78,"to":2817.14,"location":2,"content":"that's probably RGB numbers."},{"from":2817.14,"to":2820.93,"location":2,"content":"And it generated some names for the colors."},{"from":2820.93,"to":2822.14,"location":2,"content":"And I think these are pretty funny."},{"from":2822.14,"to":2824.06,"location":2,"content":"My favorite one is Stanky Bean,"},{"from":2824.06,"to":2825.14,"location":2,"content":"which is in the bottom right."},{"from":2825.14,"to":2827.93,"location":2,"content":"[LAUGHTER] Um, so, it's pretty creative,"},{"from":2827.93,"to":2830.21,"location":2,"content":"[LAUGHTER] and I think these do sound kind of"},{"from":2830.21,"to":2833.36,"location":2,"content":"like paint colors but often they're quite bizarre."},{"from":2833.36,"to":2840.91,"location":2,"content":"[LAUGHTER] Light of Blast is pretty good too."},{"from":2840.91,"to":2843.5,"location":2,"content":"So, uh, you're gonna learn more about"},{"from":2843.5,"to":2845.76,"location":2,"content":"character-level language models in a future lecture,"},{"from":2845.76,"to":2848.87,"location":2,"content":"and you're also going to learn more about how to condition a language model"},{"from":2848.87,"to":2852.44,"location":2,"content":"based on some kind of input such as the color, um, code."},{"from":2852.44,"to":2854.33,"location":2,"content":"So, these are pretty funny,"},{"from":2854.33,"to":2855.89,"location":2,"content":"uh, but I do want to say a warning."},{"from":2855.89,"to":2858.92,"location":2,"content":"Um, you'll find a lot of these kinds of articles online,"},{"from":2858.92,"to":2860.59,"location":2,"content":"uh, often with headlines like,"},{"from":2860.59,"to":2863,"location":2,"content":"\"We forced a bot to watch, you know,"},{"from":2863,"to":2866.7,"location":2,"content":"1000 hours of sci-fi movies and it wrote a script,\" something like that."},{"from":2866.7,"to":2870.8,"location":2,"content":"Um, so, my advice is you have to take these with a big pinch of salt, because often,"},{"from":2870.8,"to":2873.08,"location":2,"content":"uh, the examples that people put online were"},{"from":2873.08,"to":2875.38,"location":2,"content":"hand selected by humans to be the funniest examples."},{"from":2875.38,"to":2878.66,"location":2,"content":"Like I think all of the examples I've shown today were definitely hand selected"},{"from":2878.66,"to":2882.2,"location":2,"content":"by humans as the funniest examples that the RNN came up with."},{"from":2882.2,"to":2885.45,"location":2,"content":"And in some cases they might even have been edited by a human."},{"from":2885.45,"to":2888.56,"location":2,"content":"So, uh, yeah, you do need to be a little bit skeptical when you look at these examples."},{"from":2888.56,"to":2890.2,"location":2,"content":"[OVERLAPPING] Yep."},{"from":2890.2,"to":2892.93,"location":2,"content":"So, uh, in the Harry Potter one,"},{"from":2892.93,"to":2896.63,"location":2,"content":"there was a opening quote and then there was a closing quote."},{"from":2896.63,"to":2898.74,"location":2,"content":"So, like do you expect the RNN,"},{"from":2898.74,"to":2902,"location":2,"content":"like when it puts that opening quote and keeps putting more words,"},{"from":2902,"to":2908.82,"location":2,"content":"do you expect the probability of a closing quote to like increase as you're going or decrease?"},{"from":2908.82,"to":2911.15,"location":2,"content":"That's a great question. So, uh,"},{"from":2911.15,"to":2912.51,"location":2,"content":"the question was, uh,"},{"from":2912.51,"to":2914.45,"location":2,"content":"we noticed that in the Harry Potter example,"},{"from":2914.45,"to":2916.3,"location":2,"content":"there was some open quotes and some closed quotes."},{"from":2916.3,"to":2918.41,"location":2,"content":"And it looks like the model didn't screw up, right?"},{"from":2918.41,"to":2920.07,"location":2,"content":"All of these open quotes and closed quotes,"},{"from":2920.07,"to":2921.82,"location":2,"content":"uh, are in the correct places."},{"from":2921.82,"to":2924.45,"location":2,"content":"So, the question is, do we expect the model to put"},{"from":2924.45,"to":2928.78,"location":2,"content":"a higher probability on closing the quote given that is inside a quo- quote passage?"},{"from":2928.78,"to":2931.11,"location":2,"content":"So, I should say definitely yes and"},{"from":2931.11,"to":2934.22,"location":2,"content":"that's most- mostly the explanation for why this works."},{"from":2934.22,"to":2936.5,"location":2,"content":"Um, there's been some really interesting work in trying"},{"from":2936.5,"to":2938.54,"location":2,"content":"to look inside the hidden states of, uh,"},{"from":2938.54,"to":2941.34,"location":2,"content":"language models to see whether it's tracking things like,"},{"from":2941.34,"to":2943.61,"location":2,"content":"are we inside an open quote or a close quote?"},{"from":2943.61,"to":2946.43,"location":2,"content":"And there has been some limited evidence to show that"},{"from":2946.43,"to":2949.37,"location":2,"content":"maybe there are certain neuron or neurons inside the hidden state,"},{"from":2949.37,"to":2950.9,"location":2,"content":"which are tracking things like,"},{"from":2950.9,"to":2952.55,"location":2,"content":"are we currently inside a quote or not?"},{"from":2952.55,"to":2953.86,"location":2,"content":"[NOISE]. Yeah."},{"from":2953.86,"to":2958.37,"location":2,"content":"So, so, like do you think the probability would increase as you go more to the right [OVERLAPPING]?"},{"from":2958.37,"to":2962.27,"location":2,"content":"So, the question is as the quote passage goes on for longer,"},{"from":2962.27,"to":2963.74,"location":2,"content":"do you think the priority or"},{"from":2963.74,"to":2966.77,"location":2,"content":"the probability of outputting a closed quote should increase?"},{"from":2966.77,"to":2968.05,"location":2,"content":"Um, I don't know."},{"from":2968.05,"to":2971.42,"location":2,"content":"Maybe. Um, that would be good, I suppose,"},{"from":2971.42,"to":2972.98,"location":2,"content":"because you don't want an infinite quote,"},{"from":2972.98,"to":2975.65,"location":2,"content":"uh, but I wouldn't be surprised if that didn't happen."},{"from":2975.65,"to":2979.4,"location":2,"content":"Like I wouldn't be surprised if maybe some other worse-trained language models,"},{"from":2979.4,"to":2981.39,"location":2,"content":"just opened quotes and never closed them."},{"from":2981.39,"to":2984.82,"location":2,"content":"Uh, any other questions? Yeah."},{"from":2984.82,"to":2987.61,"location":2,"content":"What are the dimensions of the W metric?"},{"from":2987.61,"to":2990.71,"location":2,"content":"Okay. So, the question is what are the dimensions of the W metric?"},{"from":2990.71,"to":2992.48,"location":2,"content":"So we're going back to the online stuff."},{"from":2992.48,"to":2995.9,"location":2,"content":"Uh, okay. You're asking me about W_h or W_e or something else?"},{"from":2995.9,"to":2996.61,"location":2,"content":"Yeah."},{"from":2996.61,"to":2998.96,"location":2,"content":"So, W_h will be,"},{"from":2998.96,"to":3001.43,"location":2,"content":"uh, if we say that the hidden size has size n,"},{"from":3001.43,"to":3007.24,"location":2,"content":"then W_h will be n by n. And if we suppose that the embeddings have size d,"},{"from":3007.24,"to":3008.64,"location":2,"content":"then W_e will be, uh,"},{"from":3008.64,"to":3012.55,"location":2,"content":"d by n, n by d, maybe."},{"from":3012.55,"to":3019.99,"location":2,"content":"Does that answer your question? [NOISE] Uh,"},{"from":3019.99,"to":3023.38,"location":2,"content":"any other questions about generating or anything? Yep."},{"from":3023.38,"to":3028.03,"location":2,"content":"So, you said that there was a long sentence in the Harry Potter-related text?"},{"from":3028.03,"to":3028.43,"location":2,"content":"Yeah."},{"from":3028.43,"to":3033.64,"location":2,"content":"Is it ever sort of practical to combine RNNs with like in this hand written rules?"},{"from":3033.64,"to":3035.39,"location":2,"content":"Sorry. Is it ever practical to combine-"},{"from":3035.39,"to":3037.81,"location":2,"content":"RNNs with a written list of hand-written rules."},{"from":3037.81,"to":3038.83,"location":2,"content":"[OVERLAPPING]"},{"from":3038.83,"to":3039.88,"location":2,"content":"Okay. Yeah. That's a great question."},{"from":3039.88,"to":3042.22,"location":2,"content":"So the question was, is it ever practical to"},{"from":3042.22,"to":3044.98,"location":2,"content":"combine RNNs with a list of hand-written rules?"},{"from":3044.98,"to":3049.28,"location":2,"content":"For example, don't let your sentence be longer than this many words."},{"from":3049.28,"to":3050.53,"location":2,"content":"Um, so yeah."},{"from":3050.53,"to":3054.07,"location":2,"content":"I'd say it probably is practical maybe especially if you're interested in, uh,"},{"from":3054.07,"to":3056.26,"location":2,"content":"making sure that certain bad things don't happen,"},{"from":3056.26,"to":3061.9,"location":2,"content":"you might apply some hacky rules like yeah forcing it to end, uh, early."},{"from":3061.9,"to":3063.58,"location":2,"content":"I mean, okay. So there's this thing called Beam Search"},{"from":3063.58,"to":3065.34,"location":2,"content":"which we're going to learn about in a later lecture,"},{"from":3065.34,"to":3066.64,"location":2,"content":"which essentially doesn't just,"},{"from":3066.64,"to":3069.34,"location":2,"content":"um, choose one word in each step and continue."},{"from":3069.34,"to":3072.32,"location":2,"content":"It explores many different options for words you could generate."},{"from":3072.32,"to":3074.41,"location":2,"content":"And you can apply some kinds of rules on that"},{"from":3074.41,"to":3076.54,"location":2,"content":"where if you have lots of different things to choose from,"},{"from":3076.54,"to":3078.25,"location":2,"content":"then you can maybe get rid of"},{"from":3078.25,"to":3081.26,"location":2,"content":"some options if you don't like them because they break some of your rules."},{"from":3081.26,"to":3089.49,"location":2,"content":"But, um, it can be difficult to do. Any other questions?"},{"from":3089.49,"to":3098.38,"location":2,"content":"Okay. Um, so we've talked about generating from language models."},{"from":3098.38,"to":3100.63,"location":2,"content":"Uh, so unfortunately, you can't just use"},{"from":3100.63,"to":3104.14,"location":2,"content":"generation as your evaluation metric for the language models."},{"from":3104.14,"to":3107.24,"location":2,"content":"You do need some kind of, um, measurable metric."},{"from":3107.24,"to":3112.01,"location":2,"content":"So, the standard evaluation metric for language models is called perplexity."},{"from":3112.01,"to":3114.25,"location":2,"content":"And, uh, perplexity is defined as"},{"from":3114.25,"to":3118.48,"location":2,"content":"the inverse probability of the corpus according to the language model."},{"from":3118.48,"to":3122.2,"location":2,"content":"So, if you look at it you can see that that's what this formula is saying."},{"from":3122.2,"to":3124.07,"location":2,"content":"It's saying that for every, uh,"},{"from":3124.07,"to":3127.55,"location":2,"content":"word xt, lowercase t, in the corpus, uh,"},{"from":3127.55,"to":3130.42,"location":2,"content":"we're computing the probability of that word given"},{"from":3130.42,"to":3133.63,"location":2,"content":"everything that came so far but its inverse is one over that."},{"from":3133.63,"to":3136.6,"location":2,"content":"And then lastly, when normalizing this big,"},{"from":3136.6,"to":3139.96,"location":2,"content":"uh, product by the number of words,"},{"from":3139.96,"to":3143.99,"location":2,"content":"which is capital T. And the reason why we're doing that is because if we didn't do that,"},{"from":3143.99,"to":3148.2,"location":2,"content":"then perplexity would just get smaller and smaller as your corpus got bigger."},{"from":3148.2,"to":3151.14,"location":2,"content":"So we need to normalize by that factor."},{"from":3151.14,"to":3153.91,"location":2,"content":"So, you can actually show you that this, uh,"},{"from":3153.91,"to":3158.47,"location":2,"content":"perplexity is equal to the exponential of the cross-entropy loss J Theta."},{"from":3158.47,"to":3161.47,"location":2,"content":"So if you remember, cross-entropy loss J Theta is, uh,"},{"from":3161.47,"to":3164.3,"location":2,"content":"the training objective that we're using to train the language model."},{"from":3164.3,"to":3166.55,"location":2,"content":"And, uh, by rearranging things a little bit,"},{"from":3166.55,"to":3170.89,"location":2,"content":"you can see that perplexity is actually the exponential of the cross-entropy."},{"from":3170.89,"to":3172.75,"location":2,"content":"And this is a good thing, uh,"},{"from":3172.75,"to":3175.75,"location":2,"content":"because if we're training the language model to, uh,"},{"from":3175.75,"to":3178.9,"location":2,"content":"minimize the cross-entropy loss,"},{"from":3178.9,"to":3184.8,"location":2,"content":"then you are training it to optimize the perplexity as well."},{"from":3184.8,"to":3188.86,"location":2,"content":"So you should remember that the lower perplexity is better,"},{"from":3188.86,"to":3192.64,"location":2,"content":"uh, because perplexity is the inverse probability of the corpus."},{"from":3192.64,"to":3197.97,"location":2,"content":"So, uh, if you want your language model to assign high probability to the corpus, right?"},{"from":3197.97,"to":3201.6,"location":2,"content":"Then that means you want to get low perplexity."},{"from":3201.6,"to":3208.48,"location":2,"content":"Uh, any questions? [NOISE] Okay."},{"from":3208.48,"to":3216.22,"location":2,"content":"Uh, so RNNs have been pretty successful in recent years in improving perplexity."},{"from":3216.22,"to":3219.88,"location":2,"content":"So, uh, this is a results table from a recent,"},{"from":3219.88,"to":3223.63,"location":2,"content":"uh, Facebook research paper about RNN language models."},{"from":3223.63,"to":3226.6,"location":2,"content":"And, uh, you don't have to understand all of the details of this table,"},{"from":3226.6,"to":3228.05,"location":2,"content":"but what it's telling you is that,"},{"from":3228.05,"to":3230.78,"location":2,"content":"on the, uh, top where we have n gram language model."},{"from":3230.78,"to":3232.24,"location":2,"content":"And thessssssssssssn in the subsequent various,"},{"from":3232.24,"to":3235.74,"location":2,"content":"we have some increasingly complex and large RNNs."},{"from":3235.74,"to":3238.95,"location":2,"content":"And you can see that the perplexity numbers are decreasing,"},{"from":3238.95,"to":3240.47,"location":2,"content":"because lower is better."},{"from":3240.47,"to":3242.77,"location":2,"content":"So RNNs have been really great for"},{"from":3242.77,"to":3248.91,"location":2,"content":"making more effective language models in the last few years."},{"from":3248.91,"to":3251.7,"location":2,"content":"Okay. So to zoom out a little bit,"},{"from":3251.7,"to":3253.12,"location":2,"content":"you might be thinking, uh,"},{"from":3253.12,"to":3255.46,"location":2,"content":"why should I care about Language Modelling?"},{"from":3255.46,"to":3257.35,"location":2,"content":"Why is it important? I'd say there are"},{"from":3257.35,"to":3259.74,"location":2,"content":"two main reasons why Language Modelling is important."},{"from":3259.74,"to":3261.16,"location":2,"content":"Uh, so the first one is,"},{"from":3261.16,"to":3263.62,"location":2,"content":"that language modelling is a benchmark task that"},{"from":3263.62,"to":3266.77,"location":2,"content":"helps us measure our progress on understanding language."},{"from":3266.77,"to":3268.54,"location":2,"content":"So, you could view language modeling as"},{"from":3268.54,"to":3271.99,"location":2,"content":"a pretty general language understanding task, right?"},{"from":3271.99,"to":3275.43,"location":2,"content":"Because predicting what word comes next to given any,"},{"from":3275.43,"to":3277.8,"location":2,"content":"any kind of, uh, generic text."},{"from":3277.8,"to":3280.97,"location":2,"content":"Um, that's quite a difficult and general problem."},{"from":3280.97,"to":3283.33,"location":2,"content":"And in order to be good at language modelling,"},{"from":3283.33,"to":3285.34,"location":2,"content":"you have to understand a lot of things, right?"},{"from":3285.34,"to":3286.78,"location":2,"content":"You have to understand grammar,"},{"from":3286.78,"to":3288.11,"location":2,"content":"you have to understand syntax,"},{"from":3288.11,"to":3289.61,"location":2,"content":"and you have to understand,"},{"from":3289.61,"to":3291.11,"location":2,"content":"uh, logic and reasoning."},{"from":3291.11,"to":3292.57,"location":2,"content":"And you have to understand something about,"},{"from":3292.57,"to":3293.84,"location":2,"content":"you know, real-world knowledge."},{"from":3293.84,"to":3295.72,"location":2,"content":"You have to understand a lot of things in order to be"},{"from":3295.72,"to":3297.97,"location":2,"content":"able to do language modelling properly."},{"from":3297.97,"to":3299.53,"location":2,"content":"So, the reason why we care about it as"},{"from":3299.53,"to":3302.35,"location":2,"content":"a benchmark task is because if you're able to build a model,"},{"from":3302.35,"to":3305.05,"location":2,"content":"which is a better language model than the ones that came before it,"},{"from":3305.05,"to":3307.93,"location":2,"content":"then you must have made some kind of progress on at"},{"from":3307.93,"to":3311.62,"location":2,"content":"least some of those sub-components of natural language understanding."},{"from":3311.62,"to":3314.47,"location":2,"content":"So, another more tangible reason why you might"},{"from":3314.47,"to":3316.93,"location":2,"content":"care about language modelling is that it's a sub-component of"},{"from":3316.93,"to":3319.99,"location":2,"content":"many many NLP tasks especially those which involve"},{"from":3319.99,"to":3323.56,"location":2,"content":"generating text or estimating the probability of text."},{"from":3323.56,"to":3325.68,"location":2,"content":"So, here's a bunch of examples."},{"from":3325.68,"to":3327.22,"location":2,"content":"Uh, one is predictive typing."},{"from":3327.22,"to":3329.17,"location":2,"content":"That's the example that we showed at the beginning of the lecture"},{"from":3329.17,"to":3331.45,"location":2,"content":"with typing on your phone or searching on Google."},{"from":3331.45,"to":3335.18,"location":2,"content":"Uh, this is also very useful for people who have movement disabilities, uh,"},{"from":3335.18,"to":3339.59,"location":2,"content":"because they are these systems that help people communicate using fewer movements."},{"from":3339.59,"to":3341.92,"location":2,"content":"Uh, another example is speech recognition."},{"from":3341.92,"to":3343.6,"location":2,"content":"So, in speech recognition you have"},{"from":3343.6,"to":3345.82,"location":2,"content":"some kind of audio recording of a person saying something"},{"from":3345.82,"to":3349.97,"location":2,"content":"and often it's kind of noisy and hard to make out what they're saying and you need to,"},{"from":3349.97,"to":3351.7,"location":2,"content":"uh, figure out what words did they say."},{"from":3351.7,"to":3355.3,"location":2,"content":"So this an example where you have to estimate the probability of different,"},{"from":3355.3,"to":3358.21,"location":2,"content":"uh, different options of what, what it is they could have said."},{"from":3358.21,"to":3360.45,"location":2,"content":"And in the same way, handwriting recognition,"},{"from":3360.45,"to":3362.41,"location":2,"content":"is an example where there's a lot of noise"},{"from":3362.41,"to":3365.47,"location":2,"content":"and you have to figure out what the person intended to say."},{"from":3365.47,"to":3367.81,"location":2,"content":"Uh, spelling and grammar correction is yet"},{"from":3367.81,"to":3370.7,"location":2,"content":"another example where it's all about trying to figure out what someone meant."},{"from":3370.7,"to":3372.34,"location":2,"content":"And that means you actually understand how"},{"from":3372.34,"to":3374.7,"location":2,"content":"likely it is that they were saying different things."},{"from":3374.7,"to":3379.55,"location":2,"content":"Uh, an interesting, an interesting application is authorship identification."},{"from":3379.55,"to":3382.48,"location":2,"content":"So suppose that you have a piece of text and you're trying to"},{"from":3382.48,"to":3385.49,"location":2,"content":"figure out who likely wrote it and maybe you have,"},{"from":3385.49,"to":3389.83,"location":2,"content":"uh, several different authors and you have text written by those different authors."},{"from":3389.83,"to":3391.28,"location":2,"content":"So you could, for example,"},{"from":3391.28,"to":3394.72,"location":2,"content":"train a separate language model on each of the different authors' texts."},{"from":3394.72,"to":3396.16,"location":2,"content":"And then, because, remember,"},{"from":3396.16,"to":3399.8,"location":2,"content":"a language model can tell you the probability of a given piece of text."},{"from":3399.8,"to":3402.43,"location":2,"content":"Then you could ask all the different language models,"},{"from":3402.43,"to":3405.79,"location":2,"content":"um, how likely the texts and the question is,"},{"from":3405.79,"to":3409.72,"location":2,"content":"and then if a certain author's language model says that it's likely then that"},{"from":3409.72,"to":3415,"location":2,"content":"means that text the texts and the question is more likely to be written by that author."},{"from":3415,"to":3417.82,"location":2,"content":"Um, other examples include machine translation."},{"from":3417.82,"to":3419.2,"location":2,"content":"This is a huge, uh,"},{"from":3419.2,"to":3421.39,"location":2,"content":"application of language models,"},{"from":3421.39,"to":3423.57,"location":2,"content":"uh, because it's all about generating text."},{"from":3423.57,"to":3425.74,"location":2,"content":"Uh, similarly, summarization is"},{"from":3425.74,"to":3429.28,"location":2,"content":"a task where we need to generate some text given some input text."},{"from":3429.28,"to":3431.18,"location":2,"content":"Uh, dialogue as well,"},{"from":3431.18,"to":3434.98,"location":2,"content":"not all dialogue agents necessarily are RNN language models but you can"},{"from":3434.98,"to":3439.28,"location":2,"content":"build a dialogue agent that generates the text using an RNN language model."},{"from":3439.28,"to":3441.56,"location":2,"content":"And there are more examples as well."},{"from":3441.56,"to":3445.36,"location":2,"content":"Any questions on this? [LAUGHTER] Yep."},{"from":3445.36,"to":3467.88,"location":2,"content":"So, I know that [inaudible]"},{"from":3467.88,"to":3469.95,"location":2,"content":"Great question. So, the question was,"},{"from":3469.95,"to":3471.47,"location":2,"content":"uh, for some of these examples, uh,"},{"from":3471.47,"to":3475.32,"location":2,"content":"such as speech recognition or maybe [NOISE] image captioning,"},{"from":3475.32,"to":3479.29,"location":2,"content":"the input is audio or image or something that is not text, right?"},{"from":3479.29,"to":3481.78,"location":2,"content":"So, you can't represent it in the way that we've talked about so far."},{"from":3481.78,"to":3484.18,"location":2,"content":"Um, so, [NOISE] in those examples,"},{"from":3484.18,"to":3486.46,"location":2,"content":"you will have some way of representing the input,"},{"from":3486.46,"to":3488.72,"location":2,"content":"some way of encoding the audio or the image or whatever."},{"from":3488.72,"to":3493.32,"location":2,"content":"Uh, the reason I brought it up now in terms of language models is that that's the input,"},{"from":3493.32,"to":3495.68,"location":2,"content":"but you use the language model to get the output, right?"},{"from":3495.68,"to":3497.17,"location":2,"content":"So, the language model, [NOISE] uh, generates"},{"from":3497.17,"to":3499.34,"location":2,"content":"the output in the way that we saw earlier, uh,"},{"from":3499.34,"to":3502.12,"location":2,"content":"but we're gonna learn more about those conditional language [NOISE] models later."},{"from":3502.12,"to":3505.09,"location":2,"content":"[NOISE] Anyone else?"},{"from":3505.09,"to":3509.02,"location":2,"content":"[NOISE] Okay."},{"from":3509.02,"to":3512.97,"location":2,"content":"[NOISE] So, uh, here's a recap."},{"from":3512.97,"to":3516.73,"location":2,"content":"If I've lost you somewhere in this lecture, uh, or you got tired,"},{"from":3516.73,"to":3518.77,"location":2,"content":"um, now's a great time to jump back in"},{"from":3518.77,"to":3521.05,"location":2,"content":"because things are gonna get a little bit more accessible."},{"from":3521.05,"to":3523.05,"location":2,"content":"Okay. So, here's a recap of what we've done today."},{"from":3523.05,"to":3526.21,"location":2,"content":"Uh, a language model is a system that predicts the next word,"},{"from":3526.21,"to":3528.46,"location":2,"content":"[NOISE] and a recurrent neural network,"},{"from":3528.46,"to":3530.59,"location":2,"content":"is a new family, oh, new to us,"},{"from":3530.59,"to":3533.71,"location":2,"content":"a family of neural networks that takes sequential input"},{"from":3533.71,"to":3537.18,"location":2,"content":"of any length and it applies the same weights on every step,"},{"from":3537.18,"to":3539.62,"location":2,"content":"and it can optionally produce some kind of output on"},{"from":3539.62,"to":3542.02,"location":2,"content":"each step or some of the steps or none of the steps."},{"from":3542.02,"to":3544.95,"location":2,"content":"[NOISE] So, don't be confused."},{"from":3544.95,"to":3548.3,"location":2,"content":"A recurrent neural network is not [NOISE] the same thing as a language model."},{"from":3548.3,"to":3552.97,"location":2,"content":"Uh, we've seen today that an RNN is a great way to build a language model, but actually,"},{"from":3552.97,"to":3555.01,"location":2,"content":"it turns out that you can use RNNs for,"},{"from":3555.01,"to":3557.71,"location":2,"content":"uh, a lot of other different things that are not language modeling."},{"from":3557.71,"to":3559.84,"location":2,"content":"[NOISE] So, here's a few examples of that."},{"from":3559.84,"to":3564.09,"location":2,"content":"[NOISE] Uh, you can use an RNN to do a tagging task."},{"from":3564.09,"to":3566.32,"location":2,"content":"So, some examples of tagging tasks are"},{"from":3566.32,"to":3569.26,"location":2,"content":"part-of-speech tagging and named entity recognition."},{"from":3569.26,"to":3572.59,"location":2,"content":"So, pictured here is part-of-speech tagging, and this is the task."},{"from":3572.59,"to":3575.24,"location":2,"content":"We have some kind of input text such as, uh,"},{"from":3575.24,"to":3577.64,"location":2,"content":"the startled cat knocked over the vase,"},{"from":3577.64,"to":3579.39,"location":2,"content":"and your job is to, uh,"},{"from":3579.39,"to":3582.09,"location":2,"content":"label or tag each word with its part of speech."},{"from":3582.09,"to":3585.16,"location":2,"content":"So, for example, cat is a noun and knocked is a verb."},{"from":3585.16,"to":3588.2,"location":2,"content":"So, you can use an RNN to do this task in,"},{"from":3588.2,"to":3590.35,"location":2,"content":"in the way that we've pictured, which is that you, uh,"},{"from":3590.35,"to":3592.72,"location":2,"content":"feed the text into the RNN, [NOISE] and then,"},{"from":3592.72,"to":3593.91,"location":2,"content":"on each step of the RNN,"},{"from":3593.91,"to":3595.7,"location":2,"content":"you, uh, have an output,"},{"from":3595.7,"to":3597.79,"location":2,"content":"probably a distribution over what, uh,"},{"from":3597.79,"to":3601.78,"location":2,"content":"tag you think it is, and then, uh, you can tag it in that way."},{"from":3601.78,"to":3604.05,"location":2,"content":"And then, also for named entity recognition,"},{"from":3604.05,"to":3605.19,"location":2,"content":"that's all about, um,"},{"from":3605.19,"to":3608.09,"location":2,"content":"tagging each of the words with what named entity type they are."},{"from":3608.09,"to":3611.82,"location":2,"content":"So, you do it in the same way. [NOISE] Okay."},{"from":3611.82,"to":3613.47,"location":2,"content":"Here's another thing you can use RNNs for,"},{"from":3613.47,"to":3616.2,"location":2,"content":"uh, you can use them for sentence classification."},{"from":3616.2,"to":3619.08,"location":2,"content":"So, sentence classification is just a general term to mean"},{"from":3619.08,"to":3622.17,"location":2,"content":"any kind of task where you want to take sentence or other piece of text,"},{"from":3622.17,"to":3624.95,"location":2,"content":"and then, you want to classify it into one of several classes."},{"from":3624.95,"to":3628.12,"location":2,"content":"So, an example of that is sentiment classification."},{"from":3628.12,"to":3630.4,"location":2,"content":"Uh, sentiment classification is when you have some kind"},{"from":3630.4,"to":3632.68,"location":2,"content":"of input text such as, let's say, overall,"},{"from":3632.68,"to":3634.51,"location":2,"content":"I enjoyed the movie a lot, and then,"},{"from":3634.51,"to":3635.77,"location":2,"content":"you're trying to classify that as being"},{"from":3635.77,"to":3638.09,"location":2,"content":"positive or negative or [NOISE] neutral sentiment."},{"from":3638.09,"to":3640.09,"location":2,"content":"So, in this example, this is positive sentiment."},{"from":3640.09,"to":3645.4,"location":2,"content":"[NOISE] So, one way you might use an RNN to tackle this task is, uh,"},{"from":3645.4,"to":3649.45,"location":2,"content":"you might encode the text using the RNN, and then,"},{"from":3649.45,"to":3653.35,"location":2,"content":"really what you want is some kind of sentence encoding so that you"},{"from":3653.35,"to":3657.26,"location":2,"content":"can output your label for the sentence, right?"},{"from":3657.26,"to":3659.68,"location":2,"content":"And it'll be useful if you would have a single vector to"},{"from":3659.68,"to":3662.97,"location":2,"content":"represent the sentence rather than all of these separate vectors."},{"from":3662.97,"to":3664.87,"location":2,"content":"So, how would you do this?"},{"from":3664.87,"to":3667,"location":2,"content":"How would you get the sentence encoding from the RNN?"},{"from":3667,"to":3670.54,"location":2,"content":"[NOISE] Uh, one thing you could do [NOISE] is,"},{"from":3670.54,"to":3674.29,"location":2,"content":"you could use the final hidden state as your sentence encoding."},{"from":3674.29,"to":3678.46,"location":2,"content":"So, um, the reason why you might think this is a good idea is because,"},{"from":3678.46,"to":3679.81,"location":2,"content":"for example, in the RNN,"},{"from":3679.81,"to":3682.68,"location":2,"content":"we regard the, the final hidden state as,"},{"from":3682.68,"to":3685.74,"location":2,"content":"um, this is the thing you use to predict what's coming next, right?"},{"from":3685.74,"to":3688.3,"location":2,"content":"So, we're assuming that the final hidden state contains"},{"from":3688.3,"to":3691.47,"location":2,"content":"information about all of the text that has come so far, right?"},{"from":3691.47,"to":3694.99,"location":2,"content":"So, for that reason, you might suppose that this is a good sentence encoding,"},{"from":3694.99,"to":3696.46,"location":2,"content":"and we could use that [NOISE] to predict, you know,"},{"from":3696.46,"to":3699.04,"location":2,"content":"what, uh, what sentiment is this sentence."},{"from":3699.04,"to":3701.35,"location":2,"content":"And it turns out that usually, a better way to do this,"},{"from":3701.35,"to":3702.59,"location":2,"content":"usually a more effective way,"},{"from":3702.59,"to":3706.24,"location":2,"content":"is to do something like maybe take an element-wise max or"},{"from":3706.24,"to":3710.08,"location":2,"content":"an element-wise mean of all these hidden states to get your sentence encoding,"},{"from":3710.08,"to":3712.34,"location":2,"content":"um, [NOISE] and, uh,"},{"from":3712.34,"to":3714.64,"location":2,"content":"this tends to work better than just using the final hidden state."},{"from":3714.64,"to":3719.31,"location":2,"content":"[NOISE] Uh, there are some other more advanced things you can do as well."},{"from":3719.31,"to":3722.22,"location":2,"content":"Okay. [NOISE] Another thing that you can use RNNs for"},{"from":3722.22,"to":3725.34,"location":2,"content":"is as a general purpose encoder module."},{"from":3725.34,"to":3728.47,"location":2,"content":"Uh, so, here's an example that's question answering,"},{"from":3728.47,"to":3730.48,"location":2,"content":"but really this idea of RNNs as"},{"from":3730.48,"to":3735.09,"location":2,"content":"a general purpose encoder module is very common [NOISE] and use it in lots of different,"},{"from":3735.09,"to":3737.59,"location":2,"content":"um, deep learning [NOISE] architectures for NLP."},{"from":3737.59,"to":3741.18,"location":2,"content":"[NOISE] So, here's an example which is question answering."},{"from":3741.18,"to":3743.41,"location":2,"content":"Uh, so, let's suppose that the, the task is,"},{"from":3743.41,"to":3744.67,"location":2,"content":"you've got some kind of context,"},{"from":3744.67,"to":3746.11,"location":2,"content":"which, in this, uh, situation,"},{"from":3746.11,"to":3749.36,"location":2,"content":"is the Wikipedia article on Beethoven, and then,"},{"from":3749.36,"to":3751.21,"location":2,"content":"you have a question which is asking,"},{"from":3751.21,"to":3753.07,"location":2,"content":"what nationality was Beethoven?"},{"from":3753.07,"to":3756.4,"location":2,"content":"Uh, and this is actually taken from the SQuAD Challenge,"},{"from":3756.4,"to":3758.68,"location":2,"content":"which is the subject of the Default Final Project."},{"from":3758.68,"to":3761.77,"location":2,"content":"So, um, if you choose to do- to do the Default Final Project,"},{"from":3761.77,"to":3764.95,"location":2,"content":"you're going to be building systems that solve this problem."},{"from":3764.95,"to":3769.93,"location":2,"content":"So, what you might do is, you might use an RNN to process the question,"},{"from":3769.93,"to":3771.97,"location":2,"content":"what nationality was [NOISE] Beethoven?"},{"from":3771.97,"to":3776.22,"location":2,"content":"And then, you might use those hidden states that you get from this, uh,"},{"from":3776.22,"to":3780.28,"location":2,"content":"RNN of the question as a representation of the question."},{"from":3780.28,"to":3783.58,"location":2,"content":"And I'm being intentionally vague here [NOISE] about what might happen next, uh,"},{"from":3783.58,"to":3785.2,"location":2,"content":"but the idea is that you have [NOISE]"},{"from":3785.2,"to":3788.5,"location":2,"content":"both the context and the question are going to be fed some way,"},{"from":3788.5,"to":3790.9,"location":2,"content":"and maybe you'll use an RNN on context as well,"},{"from":3790.9,"to":3794.49,"location":2,"content":"and you're going to have lots more neural architecture in order to get your answer,"},{"from":3794.49,"to":3795.89,"location":2,"content":"which is, uh, German."},{"from":3795.89,"to":3801.36,"location":2,"content":"So, the point here is that the RNN is acting as an encoder for the question,"},{"from":3801.36,"to":3803.92,"location":2,"content":"that is, the hidden states that you get from running"},{"from":3803.92,"to":3806.65,"location":2,"content":"the RNN over the question, represent the question."},{"from":3806.65,"to":3811.81,"location":2,"content":"[NOISE] Uh, so, the encoder is part of a larger neural system,"},{"from":3811.81,"to":3813.94,"location":2,"content":"[NOISE] and it's the, the hidden states themselves"},{"from":3813.94,"to":3816.3,"location":2,"content":"that you're interested in because they contain the information."},{"from":3816.3,"to":3818.14,"location":2,"content":"So, you could have, um, taken,"},{"from":3818.14,"to":3819.7,"location":2,"content":"uh, element-wise max or mean,"},{"from":3819.7,"to":3821.01,"location":2,"content":"like we showed in the previous slide,"},{"from":3821.01,"to":3824.17,"location":2,"content":"to get a single vector for the question, but often, you don't do that."},{"from":3824.17,"to":3828.16,"location":2,"content":"Often, you'll, uh, do something else which uses the hidden states directly."},{"from":3828.16,"to":3833.44,"location":2,"content":"So, the general point here is that RNNs are quite powerful as a way to represent,"},{"from":3833.44,"to":3834.93,"location":2,"content":"uh, a sequence of text,"},{"from":3834.93,"to":3838.17,"location":2,"content":"uh, for further computation."},{"from":3838.17,"to":3842.93,"location":2,"content":"Okay. Last example. So, going back to RNN language models again, [NOISE] uh,"},{"from":3842.93,"to":3844.57,"location":2,"content":"they can be used to generate text,"},{"from":3844.57,"to":3847.3,"location":2,"content":"and there are lots of different, uh, applications for this."},{"from":3847.3,"to":3851.02,"location":2,"content":"So, for example, speech recognition, uh, you will have your input,"},{"from":3851.02,"to":3853.34,"location":2,"content":"which is the audio, and as a student asked earlier,"},{"from":3853.34,"to":3855.86,"location":2,"content":"this will be, uh, represented in some way,"},{"from":3855.86,"to":3859.48,"location":2,"content":"and then, uh, maybe you'll do a neural encoding of that, [NOISE] and then,"},{"from":3859.48,"to":3862.61,"location":2,"content":"you use your RNN language model to generate the output,"},{"from":3862.61,"to":3864.35,"location":2,"content":"which, in this case, is going to be a transcription"},{"from":3864.35,"to":3866.28,"location":2,"content":"of what the audio recording is saying."},{"from":3866.28,"to":3868.03,"location":2,"content":"So, you will have some way of conditioning,"},{"from":3868.03,"to":3869.83,"location":2,"content":"and we're gonna talk more about how this works, uh,"},{"from":3869.83,"to":3871.78,"location":2,"content":"in a later lecture, but you have some way of"},{"from":3871.78,"to":3875.23,"location":2,"content":"conditioning your RNN language model on the input."},{"from":3875.23,"to":3878.92,"location":2,"content":"So, you'll use that to generate your text, [NOISE] and in this case,"},{"from":3878.92,"to":3881.34,"location":2,"content":"the utterance might be something like, what's the weather,"},{"from":3881.34,"to":3884.59,"location":2,"content":"question mark. [OVERLAPPING] [NOISE]"},{"from":3884.59,"to":3894.22,"location":2,"content":"Yeah. [NOISE]"},{"from":3894.22,"to":3898.12,"location":2,"content":"In speech recognition, [inaudible]."},{"from":3898.12,"to":3900.1,"location":2,"content":"Okay. So, the question is, in speech recognition,"},{"from":3900.1,"to":3902.76,"location":2,"content":"we often use word error rates to evaluate,"},{"from":3902.76,"to":3904.69,"location":2,"content":"but would you use perplexity to evaluate?"},{"from":3904.69,"to":3907.69,"location":2,"content":"[NOISE] Um, I don't actually know much about that. Do you know, Chris,"},{"from":3907.69,"to":3909.25,"location":2,"content":"what they use in, uh,"},{"from":3909.25,"to":3915.01,"location":2,"content":"speech recognition as an eval metric? [NOISE]"},{"from":3915.01,"to":3923.59,"location":2,"content":"[inaudible] word error rate [inaudible]."},{"from":3923.59,"to":3925.38,"location":2,"content":"The answer is, you often use WER,"},{"from":3925.38,"to":3927.55,"location":2,"content":"uh, for eval, but you might also use perplexity."},{"from":3927.55,"to":3929.5,"location":2,"content":"Yeah. Any other questions?"},{"from":3929.5,"to":3935.57,"location":2,"content":"[NOISE] Okay. So, um,"},{"from":3935.57,"to":3938.35,"location":2,"content":"this is an example of a conditional language model,"},{"from":3938.35,"to":3939.97,"location":2,"content":"and it's called a conditional language model"},{"from":3939.97,"to":3941.72,"location":2,"content":"because we have the language model component,"},{"from":3941.72,"to":3944.74,"location":2,"content":"but crucially, we're conditioning it on some kind of input."},{"from":3944.74,"to":3948.58,"location":2,"content":"So, unlike the, uh, fun examples like with the Harry Potter text where we were just, uh,"},{"from":3948.58,"to":3951.46,"location":2,"content":"generating text basically unconditionally, you know,"},{"from":3951.46,"to":3952.75,"location":2,"content":"we trained it on the training data, and then,"},{"from":3952.75,"to":3954.82,"location":2,"content":"we just started [NOISE] with some kind of random seed,"},{"from":3954.82,"to":3956.3,"location":2,"content":"and then, it generates unconditionally."},{"from":3956.3,"to":3958.54,"location":2,"content":"This is called a conditional language model"},{"from":3958.54,"to":3961.61,"location":2,"content":"because there's some kind of input that we need to condition on."},{"from":3961.61,"to":3965.98,"location":2,"content":"Uh, machine translation is an example [NOISE] also of a conditional language model,"},{"from":3965.98,"to":3967.78,"location":2,"content":"and we're going to see that in much more detail in"},{"from":3967.78,"to":3969.52,"location":2,"content":"the lecture next week on machine translation."},{"from":3969.52,"to":3972.89,"location":2,"content":"[NOISE] All right. Are there any more questions?"},{"from":3972.89,"to":3974.32,"location":2,"content":"You have a bit of extra time, I think."},{"from":3974.32,"to":3977.66,"location":2,"content":"[NOISE] Yeah."},{"from":3977.66,"to":3980.35,"location":2,"content":"I have a question about RNNs in general."},{"from":3980.35,"to":3985.34,"location":2,"content":"[NOISE] Do people ever combine the RNN,"},{"from":3985.34,"to":3987.22,"location":2,"content":"uh, patterns of architecture,"},{"from":3987.22,"to":3989.97,"location":2,"content":"um, with other neural networks?"},{"from":3989.97,"to":3991.89,"location":2,"content":"Say, [NOISE] you have, um, you know,"},{"from":3991.89,"to":3994.28,"location":2,"content":"N previous layers that could be doing anything,"},{"from":3994.28,"to":3995.41,"location":2,"content":"and at the end of your network,"},{"from":3995.41,"to":3996.88,"location":2,"content":"you wanna run them through,"},{"from":3996.88,"to":3999.16,"location":2,"content":"uh, five recurrent layers."},{"from":3999.16,"to":4000.81,"location":2,"content":"Do people mix and match like that,"},{"from":4000.81,"to":4002.19,"location":2,"content":"or these, uh, [inaudible]. [NOISE]"},{"from":4002.19,"to":4006.09,"location":2,"content":"Uh, the question is,"},{"from":4006.09,"to":4008.58,"location":2,"content":"do you ever combine RNN for the other types of architecture?"},{"from":4008.58,"to":4009.87,"location":2,"content":"So, I think the answer is yes."},{"from":4009.87,"to":4011.59,"location":2,"content":"[NOISE] Uh, you might, [NOISE] you know, uh,"},{"from":4011.59,"to":4015.21,"location":2,"content":"have- you might have other types of architectures, uh,"},{"from":4015.21,"to":4018.54,"location":2,"content":"to produce the vectors that are going to be the input to RNN,"},{"from":4018.54,"to":4020.28,"location":2,"content":"or you might use the output of your RNN"},{"from":4020.28,"to":4026.39,"location":2,"content":"[NOISE] and feed that into a different type of neural network."},{"from":4026.39,"to":4028.62,"location":2,"content":"So, yes. [NOISE] Any other questions?"},{"from":4028.62,"to":4031.82,"location":2,"content":"[NOISE] Okay."},{"from":4031.82,"to":4035.51,"location":2,"content":"Uh, so, before we finish, uh, I have a note on terminology."},{"from":4035.51,"to":4037.49,"location":2,"content":"Uh, when you're reading papers,"},{"from":4037.49,"to":4040.91,"location":2,"content":"you might find often this phrase vanilla RNN,"},{"from":4040.91,"to":4043.07,"location":2,"content":"and when you see the phrase vanilla RNN,"},{"from":4043.07,"to":4044.53,"location":2,"content":"that usually means, uh,"},{"from":4044.53,"to":4046.91,"location":2,"content":"the RNNs that are described in this lecture."},{"from":4046.91,"to":4050.46,"location":2,"content":"So, the reason why those are called vanilla RNNs is"},{"from":4050.46,"to":4054.76,"location":2,"content":"because there are actually other more complex kinds of RNN flavors."},{"from":4054.76,"to":4058.01,"location":2,"content":"So, for example, there's GRU and LSTM,"},{"from":4058.01,"to":4060.33,"location":2,"content":"and we're gonna learn about both of those next week."},{"from":4060.33,"to":4062.61,"location":2,"content":"And another thing we're going to learn about next week"},{"from":4062.61,"to":4065.09,"location":2,"content":"[NOISE] is that you can actually get some multi-layer RNNs,"},{"from":4065.09,"to":4068.25,"location":2,"content":"which is when you stack multiple RNNs on top of each other."},{"from":4068.25,"to":4070.93,"location":2,"content":"[NOISE] So, uh, you're gonna learn about those,"},{"from":4070.93,"to":4073.88,"location":2,"content":"but we hope that by the time you reach the end of this course,"},{"from":4073.88,"to":4076.91,"location":2,"content":"you're going to be able to read a research paper and see a phrase like"},{"from":4076.91,"to":4081.15,"location":2,"content":"stacked bidirectional LSTM with residual connections and self-attention,"},{"from":4081.15,"to":4082.68,"location":2,"content":"and you'll know exactly what that is."},{"from":4082.68,"to":4084.84,"location":2,"content":"[NOISE] That's just an RNN with all of the toppings."},{"from":4084.84,"to":4087.84,"location":2,"content":"[LAUGHTER] All right. Thank you. That's it for today."},{"from":4087.84,"to":4095.91,"location":2,"content":"[NOISE] Uh, next time- [APPLAUSE] next time,"},{"from":4095.91,"to":4098.34,"location":2,"content":"we're learning about problems [NOISE] and fancy RNNs."},{"from":4098.34,"to":4104.77,"location":2,"content":"[NOISE]"}]}