{"font_size":0.4,"font_color":"#FFFFFF","background_alpha":0.5,"background_color":"#9C27B0","Stroke":"none","body":[{"from":5.66,"to":8.14,"location":2,"content":"Okay hi everyone."},{"from":8.14,"to":11.21,"location":2,"content":"Let's get started again."},{"from":11.21,"to":17.01,"location":2,"content":"Um. Okay. So, first of all for a couple of announcements."},{"from":17.01,"to":20.04,"location":2,"content":"Um, first of all thanks to everyone, um,"},{"from":20.04,"to":23.95,"location":2,"content":"who filled in our mid-quarter survey we've actually gotten,"},{"from":23.95,"to":27.32,"location":2,"content":"um, great participation in that."},{"from":27.32,"to":29.86,"location":2,"content":"Here are my two little Pac-Man figures."},{"from":29.86,"to":31.64,"location":2,"content":"So, the Pac-Man figures thinks,"},{"from":31.64,"to":34.58,"location":2,"content":"means that almost everyone thinks the lectures are at"},{"from":34.58,"to":38.39,"location":2,"content":"the right pace and those that don't are pretty much evenly divided."},{"from":38.39,"to":42.68,"location":2,"content":"Um, if we go for how challenging was Assignment three,"},{"from":42.68,"to":46.19,"location":2,"content":"slightly more people thought it was too easy than too hard."},{"from":46.19,"to":48.68,"location":2,"content":"So, I guess we're setting about rectifying that with"},{"from":48.68,"to":53.46,"location":2,"content":"assignments four and five, um, [NOISE]."},{"from":53.46,"to":56.03,"location":2,"content":"So, though there are a whole bunch of other questions and we've"},{"from":56.03,"to":59.18,"location":2,"content":"been trying to absorb all the feedback."},{"from":59.18,"to":63.98,"location":2,"content":"I mean one of the questions was what people wanted most from the remaining lectures."},{"from":63.98,"to":69.26,"location":2,"content":"I guess the good news here is really we're very good at predicting, um,"},{"from":69.26,"to":71.78,"location":2,"content":"what people wanted, that or else everybody"},{"from":71.78,"to":74.66,"location":2,"content":"just looked ahead in the syllabus and wrote down what it said was"},{"from":74.66,"to":79.16,"location":2,"content":"ahead in the syllabus but I guess the most popular four answers to"},{"from":79.16,"to":83.81,"location":2,"content":"topics that they wanted in the remaining lectures were Transformers and BERT,"},{"from":83.81,"to":85.86,"location":2,"content":"both of which are gonna be covered this week."},{"from":85.86,"to":89.78,"location":2,"content":"Uh, question-answering which we talked about last week, um,"},{"from":89.78,"to":93.23,"location":2,"content":"and then text generation and summarization"},{"from":93.23,"to":98.28,"location":2,"content":"and you guys get Abby back next week to talk about that."},{"from":98.28,"to":102.38,"location":2,"content":"Um, there are also a lot of people also answered this question"},{"from":102.38,"to":106.4,"location":2,"content":"a different way as to what kind of style of stuff,"},{"from":106.4,"to":110.96,"location":2,"content":"um, some people emphasized new research and the latest updates from the field."},{"from":110.96,"to":113.3,"location":2,"content":"I guess we'll get some of that today as well,"},{"from":113.3,"to":115.23,"location":2,"content":"some people are more interested in"},{"from":115.23,"to":120,"location":2,"content":"successful applications in industry or trying to do a bit of that,"},{"from":120,"to":122.84,"location":2,"content":"um, cool new neural architectures."},{"from":122.84,"to":125.66,"location":2,"content":"Um, the bottom answer wasn't the most popular one,"},{"from":125.66,"to":128.39,"location":2,"content":"I'll admit but at least a few people, um,"},{"from":128.39,"to":131.81,"location":2,"content":"wish that we were teaching more linguistic stuff."},{"from":131.81,"to":134.3,"location":2,"content":"Um, I mean that is something that I actually feel"},{"from":134.3,"to":138.58,"location":2,"content":"a bit awkward about the way things were merged with CS224N,"},{"from":138.58,"to":140.1,"location":2,"content":"with this deep learning,"},{"from":140.1,"to":142.37,"location":2,"content":"I mean the truth of the matter is that sort of seems"},{"from":142.37,"to":144.85,"location":2,"content":"like in the early part of the course,"},{"from":144.85,"to":147.21,"location":2,"content":"there's so much to cover with,"},{"from":147.21,"to":149.66,"location":2,"content":"um, neural networks, backpropagation,"},{"from":149.66,"to":154.25,"location":2,"content":"different, um, neural net architectures and so on that the reality is that we"},{"from":154.25,"to":159.89,"location":2,"content":"teach rather less linguistic stuff than we used to in the class."},{"from":159.89,"to":163.16,"location":2,"content":"I mean, for the last four weeks of the class we really do try and"},{"from":163.16,"to":166.73,"location":2,"content":"cover some more linguistic stuff topics."},{"from":166.73,"to":169.25,"location":2,"content":"Um, so look forward to that."},{"from":169.25,"to":171.25,"location":2,"content":"Um, announcements."},{"from":171.25,"to":174.37,"location":2,"content":"Okay. So we've made a couple of deadline changes."},{"from":174.37,"to":177.41,"location":2,"content":"Um, firstly, a number of people have"},{"from":177.41,"to":180.92,"location":2,"content":"mentioned that they think assignment five is a bit tough."},{"from":180.92,"to":184.16,"location":2,"content":"And so, we're giving people one extra day,"},{"from":184.16,"to":186.1,"location":2,"content":"um, to do assignment five."},{"from":186.1,"to":189.83,"location":2,"content":"Um, I'm realizing in one sense that one extra day is not a ton"},{"from":189.83,"to":193.76,"location":2,"content":"but you know there's sort of this complex balance here because on the other hand,"},{"from":193.76,"to":198.69,"location":2,"content":"we don't really want to undermine time that people have available for final projects."},{"from":198.69,"to":203.01,"location":2,"content":"And if you're one of the people who hasn't yet started assignment five,"},{"from":203.01,"to":206.07,"location":2,"content":"um, we do really encourage you to get underway on it."},{"from":206.07,"to":209.96,"location":2,"content":"Um, yeah, in the reverse direction"},{"from":209.96,"to":214.16,"location":2,"content":"we decided that the project milestone was really too late."},{"from":214.16,"to":218.06,"location":2,"content":"If we are going to be able to give you feedback on it that you could usefully make use"},{"from":218.06,"to":222.08,"location":2,"content":"of, so we're moving the project milestone date two days earlier."},{"from":222.08,"to":225.94,"location":2,"content":"And so, we've also gotten everyone's project proposals and our"},{"from":225.94,"to":230.27,"location":2,"content":"planned hope is to get them back to everybody on Friday."},{"from":230.27,"to":232.31,"location":2,"content":"Yes, so, a lot of things moving."},{"from":232.31,"to":236.97,"location":2,"content":"Um, and finally on other announcements I guess, um, on"},{"from":236.97,"to":241.64,"location":2,"content":"this Thursday is our first invited speaker, um, and so,"},{"from":241.64,"to":245.16,"location":2,"content":"if you're in person student you're meant to be here,"},{"from":245.16,"to":248.93,"location":2,"content":"um, and if you're not able to be here,"},{"from":248.93,"to":252.35,"location":2,"content":"you should know about our reaction paragraph policy and"},{"from":252.35,"to":256.46,"location":2,"content":"I actually stuck up on the Piazza pinned posts about, um,"},{"from":256.46,"to":261.38,"location":2,"content":"reaction pieces and attendance, an example of a reaction piece, um,"},{"from":261.38,"to":267.32,"location":2,"content":"from a past class to make it a little bit more concrete what's expected there."},{"from":267.32,"to":271.85,"location":2,"content":"But, you know, the idea is what we're hoping for something that isn't a ton of work."},{"from":271.85,"to":275.86,"location":2,"content":"You can just write 100, 150 words, a few sentences,"},{"from":275.86,"to":280.04,"location":2,"content":"but wanting you to pick out a specific thing that was"},{"from":280.04,"to":282.41,"location":2,"content":"interesting and write a couple of sentences"},{"from":282.41,"to":285.14,"location":2,"content":"about what it was and what your thoughts are about it."},{"from":285.14,"to":290.27,"location":2,"content":"I, not just some very generic statement of this was a lecture about transformers."},{"from":290.27,"to":292.81,"location":2,"content":"He talked about transformers and it was interesting,"},{"from":292.81,"to":298.63,"location":2,"content":"that is not what we want for the reaction piece. Um, okay."},{"from":298.63,"to":301.14,"location":2,"content":"So, here's the plan for today."},{"from":301.14,"to":304.95,"location":2,"content":"So, for today's, what I want to talk about is,"},{"from":304.95,"to":309.79,"location":2,"content":"um, the exciting recent work about contextual word representations."},{"from":309.79,"to":315.62,"location":2,"content":"I mean I, I was thinking of what I was gonna say I was wanting to say, oh, this is"},{"from":315.62,"to":318.77,"location":2,"content":"the most exciting thing in deep learning for NLP in"},{"from":318.77,"to":322.04,"location":2,"content":"the last five years then something's just completely wrong,"},{"from":322.04,"to":327.08,"location":2,"content":"because really this is the most exciting thing in deep learning that happened in 2018."},{"from":327.08,"to":329.96,"location":2,"content":"I mean, I guess things move very quickly, um,"},{"from":329.96,"to":333.53,"location":2,"content":"in deep learning at the moment and it's sort of I don't think it's"},{"from":333.53,"to":338.13,"location":2,"content":"really fair to say that you know it's got 5 years of life."},{"from":338.13,"to":340.49,"location":2,"content":"But there's a very exciting thing that happened last year,"},{"from":340.49,"to":342.65,"location":2,"content":"and we'll talk about that."},{"from":342.65,"to":345.72,"location":2,"content":"Okay. So, we'll talk about early stuff,"},{"from":345.72,"to":347.45,"location":2,"content":"the ELMo, ULMfit,"},{"from":347.45,"to":350.63,"location":2,"content":"transformer architectures briefly and then go on to"},{"from":350.63,"to":355.67,"location":2,"content":"talk about the BERT model that's being quite prominent lately."},{"from":355.67,"to":358.21,"location":2,"content":"So, let's just recap,"},{"from":358.21,"to":362.61,"location":2,"content":"let's just go backwards a bit first to think about, um,"},{"from":362.61,"to":368.07,"location":2,"content":"where we've been and where we are now and why we might want something more."},{"from":368.07,"to":369.69,"location":2,"content":"So, up until now,"},{"from":369.69,"to":371.06,"location":2,"content":"we've sort of just had,"},{"from":371.06,"to":376.25,"location":2,"content":"one representation for words which is what we learned at the beginning of class,"},{"from":376.25,"to":382.08,"location":2,"content":"there was a word, you trained a word vector for it and that's what you used in your model."},{"from":382.08,"to":384.77,"location":2,"content":"Um, and you could do that, with algorithms like Word2vec,"},{"from":384.77,"to":388.07,"location":2,"content":"GloVe, or fastText that I mentioned last week."},{"from":388.07,"to":394.46,"location":2,"content":"Um, so some on this sort of progression of ideas in deep learning,"},{"from":394.46,"to":399.05,"location":2,"content":"when deep learning for NLP or the general"},{"from":399.05,"to":402.06,"location":2,"content":"just the resurgence of neural networks for NLP"},{"from":402.06,"to":405.62,"location":2,"content":"came about sort of at the beginning of this decade."},{"from":405.62,"to":410.64,"location":2,"content":"Um, these pre-trained word vectors."},{"from":410.64,"to":414.79,"location":2,"content":"So, pre-trained unsupervised over a large amount of text."},{"from":414.79,"to":418.27,"location":2,"content":"They were completely seen as the secret sauce,"},{"from":418.27,"to":420.81,"location":2,"content":"and they were the thing that transformed"},{"from":420.81,"to":424.8,"location":2,"content":"neural networks from NLP to something that didn't really work,"},{"from":424.8,"to":426.65,"location":2,"content":"to something that worked great."},{"from":426.65,"to":429.91,"location":2,"content":"Um, so, this is actually an old slide of mine."},{"from":429.91,"to":432.67,"location":2,"content":"So, this is a slide I guess I first made for"},{"from":432.67,"to":438.49,"location":2,"content":"2012 ACL tutorial and then sort of used in lectures."},{"from":438.49,"to":443,"location":2,"content":"Sort of in 2013, 2014. Um-."},{"from":443,"to":446.46,"location":2,"content":"And so this was sort of the picture in those years."},{"from":446.46,"to":448,"location":2,"content":"So this was looking at two tasks,"},{"from":448,"to":453.12,"location":2,"content":"part of speech tagging and named entity recognition which I'll use quite a bit today."},{"from":453.12,"to":458.27,"location":2,"content":"And, you know, the top line was showing a state of the art which was"},{"from":458.27,"to":462.78,"location":2,"content":"a traditional categorical feature based classifier of the kind"},{"from":462.78,"to":467.44,"location":2,"content":"that dominated NLP in the 2000s decade, in their performance."},{"from":467.44,"to":473.21,"location":2,"content":"And what then the next line showed is that if you took the same data set"},{"from":473.21,"to":479.51,"location":2,"content":"and you trained a supervised neural network on it and said how good is your performance?"},{"from":479.51,"to":481.85,"location":2,"content":"Um, the story was, it wasn't great."},{"from":481.85,"to":486.72,"location":2,"content":"Um, part-of-speech tagging has very high numbers always for various reasons."},{"from":486.72,"to":491.39,"location":2,"content":"So perhaps the more indicative one to look at is these named entity recognition numbers."},{"from":491.39,"to":494.53,"location":2,"content":"So, you know, this was sort of neural net sucked, right?"},{"from":494.53,"to":498.31,"location":2,"content":"The reason why last decade everybody used, um,"},{"from":498.31,"to":500.79,"location":2,"content":"categorical feature based, you know,"},{"from":500.79,"to":503.34,"location":2,"content":"CRF, SVM kind of classifiers."},{"from":503.34,"to":507,"location":2,"content":"Well, if you look, it worked eight percent better than a neural network."},{"from":507,"to":508.33,"location":2,"content":"Why wouldn't anybody?"},{"from":508.33,"to":512.85,"location":2,"content":"But then what had happened was people had come up with this idea that we could"},{"from":512.85,"to":517.51,"location":2,"content":"do unsupervised pre-training of word representations,"},{"from":517.51,"to":520.77,"location":2,"content":"um, to come up with word vectors for words."},{"from":520.77,"to":522.06,"location":2,"content":"And, you know, in those days,"},{"from":522.06,"to":525.62,"location":2,"content":"this was very hard to do the alg- both because of"},{"from":525.62,"to":529.5,"location":2,"content":"the kind of algorithms and the kind of machines that were available, right?"},{"from":529.5,"to":531.99,"location":2,"content":"So Collobert and Weston, 2011,"},{"from":531.99,"to":536.98,"location":2,"content":"spent seven weeks training their unsupervised word representations."},{"from":536.98,"to":538.11,"location":2,"content":"And at the end of the day,"},{"from":538.11,"to":541.79,"location":2,"content":"there are only 100 dimensional, um, word representations."},{"from":541.79,"to":543.87,"location":2,"content":"But this was the miracle breakthrough, right?"},{"from":543.87,"to":548.97,"location":2,"content":"You've put in this miracle breakthrough of unsupervised word representations."},{"from":548.97,"to":552.22,"location":2,"content":"And now, the neural net is getting to 88.87."},{"from":552.22,"to":555.38,"location":2,"content":"So it's almost as good as the feature-based classifier,"},{"from":555.38,"to":557.51,"location":2,"content":"and then like any good engineers,"},{"from":557.51,"to":559.5,"location":2,"content":"they did some hacking with some extra features,"},{"from":559.5,"to":561.17,"location":2,"content":"because they had some stuff like that."},{"from":561.17,"to":566.72,"location":2,"content":"And they got a system that was then slightly better than the feature based system."},{"from":566.72,"to":569.52,"location":2,"content":"Okay. So that was sort of our picture that,"},{"from":569.52,"to":573.18,"location":2,"content":"um, having these pre-trained,"},{"from":573.18,"to":576.56,"location":2,"content":"unsuper- and unsupervised manner of word representations,"},{"from":576.56,"to":579,"location":2,"content":"that was sort of the big breakthrough and"},{"from":579,"to":582.28,"location":2,"content":"the secret sauce that gave all the oomph that made,"},{"from":582.28,"to":584.76,"location":2,"content":"um, neural networks competitive."},{"from":584.76,"to":586.2,"location":2,"content":"Um, but, you know,"},{"from":586.2,"to":591.57,"location":2,"content":"it's a sort of a funny thing happened which was after people had sort of had"},{"from":591.57,"to":594.43,"location":2,"content":"some of these initial breakthroughs which were"},{"from":594.43,"to":597.51,"location":2,"content":"all about unsupervised methods for pre-training,"},{"from":597.51,"to":599.03,"location":2,"content":"it was the same in vision."},{"from":599.03,"to":600.63,"location":2,"content":"This was the era in vision,"},{"from":600.63,"to":603.38,"location":2,"content":"where you were building restricted Boltzmann machines and doing"},{"from":603.38,"to":607.41,"location":2,"content":"complicated unsupervised pre-training techniques on them as well."},{"from":607.41,"to":613.42,"location":2,"content":"Some- somehow, after people had kind of discovered that and started to get good on it,"},{"from":613.42,"to":616.26,"location":2,"content":"people sort of started to discover, well,"},{"from":616.26,"to":620.28,"location":2,"content":"actually we have some new technologies for non-linearities,"},{"from":620.28,"to":622.66,"location":2,"content":"regularization, and things like that."},{"from":622.66,"to":625.83,"location":2,"content":"And if we keep using those same technologies,"},{"from":625.83,"to":629.77,"location":2,"content":"we can just go back to good old supervised learning."},{"from":629.77,"to":634.51,"location":2,"content":"And shockingly, it works way better now inside neural networks."},{"from":634.51,"to":637.44,"location":2,"content":"And so if you sort of go ahead to what I will call,"},{"from":637.44,"to":643.62,"location":2,"content":"sort of 2014 to 2018 picture,"},{"from":643.62,"to":646.45,"location":2,"content":"the, the picture is actually very different."},{"from":646.45,"to":648.27,"location":2,"content":"So the picture is, so this,"},{"from":648.27,"to":651.29,"location":2,"content":"the results I'm actually gonna show you this is from the Chen and Manning,"},{"from":651.29,"to":654.55,"location":2,"content":"um, neural dependency parser that we talked about weeks ago."},{"from":654.55,"to":656.7,"location":2,"content":"The picture there was, um,"},{"from":656.7,"to":659.07,"location":2,"content":"and you could- despite the fact that"},{"from":659.07,"to":663,"location":2,"content":"this dependency parser is being trained on a pretty small corpus,"},{"from":663,"to":665.64,"location":2,"content":"a million words of supervised data,"},{"from":665.64,"to":669.25,"location":2,"content":"you can just initialize it with random word vectors,"},{"from":669.25,"to":671.83,"location":2,"content":"um, and train a dependency parser."},{"from":671.83,"to":673.74,"location":2,"content":"And to a first approximation,"},{"from":673.74,"to":675.06,"location":2,"content":"it just works fine."},{"from":675.06,"to":677.93,"location":2,"content":"You get, get sort of a 90 percent accuracy,"},{"from":677.93,"to":680.42,"location":2,"content":"E- um, English dependency parser."},{"from":680.42,"to":683.74,"location":2,"content":"Now, it is the case that instead,"},{"from":683.74,"to":687.48,"location":2,"content":"you could use pre-trained word embeddings and you do a bit better."},{"from":687.48,"to":689.23,"location":2,"content":"You do about one percent better."},{"from":689.23,"to":691.55,"location":2,"content":"And so this was sort of the,"},{"from":691.55,"to":695.73,"location":2,"content":"the new world order which was yeah, um,"},{"from":695.73,"to":700.86,"location":2,"content":"these pre-trained unsupervised word embeddings are useful because you can"},{"from":700.86,"to":706.84,"location":2,"content":"train them from a lot more data and they can know about a much larger vocabulary."},{"from":706.84,"to":708.02,"location":2,"content":"That means they are useful."},{"from":708.02,"to":711.8,"location":2,"content":"They help with rare words and things like that and they give you a percent,"},{"from":711.8,"to":715.16,"location":2,"content":"but they're definitely no longer the sort of night and day,"},{"from":715.16,"to":719.91,"location":2,"content":"uh, thing to make neural networks work that we used to believe."},{"from":719.91,"to":724.47,"location":2,"content":"I'm, I'm just gonna deviate here to,"},{"from":724.47,"to":728.46,"location":2,"content":"from the main narrative to just sort of say, um,"},{"from":728.46,"to":733.49,"location":2,"content":"one more tip for dealing with unknown words with word vectors,"},{"from":733.49,"to":736.29,"location":2,"content":"um, just in case it's useful for some people,"},{"from":736.29,"to":739.35,"location":2,"content":"building question answering systems, right?"},{"from":739.35,"to":744.45,"location":2,"content":"So, um, so for sort of word vectors on unknown words, you know,"},{"from":744.45,"to":749.7,"location":2,"content":"the commonest thing historically is you've got your supervised training data,"},{"from":749.7,"to":752.79,"location":2,"content":"you define a vocab which might be words that occur"},{"from":752.79,"to":756.25,"location":2,"content":"five times or more in your supervised training data."},{"from":756.25,"to":759.04,"location":2,"content":"And you treat everything else as an UNK."},{"from":759.04,"to":762.09,"location":2,"content":"And so you also train one vector per UNK."},{"from":762.09,"to":766.14,"location":2,"content":"Um, but that has some problems which you have no way to"},{"from":766.14,"to":771.25,"location":2,"content":"distinguish different UNK words either for identity or meaning."},{"from":771.25,"to":774.75,"location":2,"content":"And that tends to be problematic for question answering systems."},{"from":774.75,"to":778.14,"location":2,"content":"And so one way to fix that is what we talked about last week,"},{"from":778.14,"to":780.63,"location":2,"content":"you just say, \"Oh, words are made out of characters."},{"from":780.63,"to":785.65,"location":2,"content":"I can use character representations to learn word vectors for other words.\""},{"from":785.65,"to":786.96,"location":2,"content":"And you can certainly do that."},{"from":786.96,"to":788.23,"location":2,"content":"You might wanna try that."},{"from":788.23,"to":790.21,"location":2,"content":"That adds some complexity."},{"from":790.21,"to":794.38,"location":2,"content":"Um, but especially for things like question answering systems,"},{"from":794.38,"to":796.38,"location":2,"content":"there are a couple of other things that you can do"},{"from":796.38,"to":798.72,"location":2,"content":"that work considerably better and they've been"},{"from":798.72,"to":803.37,"location":2,"content":"explored in this paper by Dhingra et al., um, from 2017."},{"from":803.37,"to":806.7,"location":2,"content":"Um, the first one is to say, well, um,"},{"from":806.7,"to":814.25,"location":2,"content":"when you at test-time encounter new words, probably your unsupervised word,"},{"from":814.25,"to":819.34,"location":2,"content":"pre-trained word embeddings have a much bigger vocabulary than your actual system does."},{"from":819.34,"to":822.09,"location":2,"content":"So anytime you come across a word that isn't in"},{"from":822.09,"to":824.96,"location":2,"content":"your vocab but is in the pre-trained word embeddings,"},{"from":824.96,"to":828.99,"location":2,"content":"just use, get the word vector of that word and start using it."},{"from":828.99,"to":831.75,"location":2,"content":"That'll be a much more useful thing to use."},{"from":831.75,"to":833.85,"location":2,"content":"And then there's a second possible tip that if you"},{"from":833.85,"to":836.3,"location":2,"content":"see something that's still an unknown word,"},{"from":836.3,"to":837.92,"location":2,"content":"rather than treating it as UNK,"},{"from":837.92,"to":840.03,"location":2,"content":"you just assign it on the spot,"},{"from":840.03,"to":841.75,"location":2,"content":"a random word vector."},{"from":841.75,"to":846.81,"location":2,"content":"And so this has the effect that each word does get a unique identity."},{"from":846.81,"to":849.36,"location":2,"content":"Which means if you see the same word in the question,"},{"from":849.36,"to":851.07,"location":2,"content":"and a potential answer,"},{"from":851.07,"to":854.94,"location":2,"content":"they will match together beautifully in an accurate way which you're"},{"from":854.94,"to":859.89,"location":2,"content":"not getting with just UNK matching and those can be kind of useful ideas to try."},{"from":859.89,"to":865.27,"location":2,"content":"Okay, end digression. Okay, so up until now,"},{"from":865.27,"to":868.23,"location":2,"content":"we just sort of had this representation of words,"},{"from":868.23,"to":871.6,"location":2,"content":"we ran Word2vec and we got a word vector,"},{"from":871.6,"to":873.73,"location":2,"content":"um, for each word."},{"from":873.73,"to":877.57,"location":2,"content":"Um, so, um, that, that was useful."},{"from":877.57,"to":879.11,"location":2,"content":"It's worked pretty well."},{"from":879.11,"to":881.54,"location":2,"content":"Um, but it had, um,"},{"from":881.54,"to":888.53,"location":2,"content":"some big problems. So what were the big problems of doing that?"},{"from":888.53,"to":890.78,"location":2,"content":"The problems when we,"},{"from":890.78,"to":893.48,"location":2,"content":"of having a word vector in each word, yes."},{"from":893.48,"to":896.83,"location":2,"content":"A lot of words have like one spelling, but a whole bunch of meanings."},{"from":896.83,"to":900.55,"location":2,"content":"Right, so, a word can have- So, typically,"},{"from":900.55,"to":905.62,"location":2,"content":"you have one string of letters which has a whole bunch of meanings."},{"from":905.62,"to":909.22,"location":2,"content":"So, words have a ton of senses."},{"from":909.22,"to":911.35,"location":2,"content":"Um, and yeah, so that's"},{"from":911.35,"to":913.4,"location":2,"content":"the biggest and most obvious problem that we're"},{"from":913.4,"to":915.55,"location":2,"content":"collapsing together all the meanings of words."},{"from":915.55,"to":918.13,"location":2,"content":"So, we talked about a bit where"},{"from":918.13,"to":920.29,"location":2,"content":"one solution to that was you could distinguish"},{"from":920.29,"to":923.68,"location":2,"content":"word senses and to have different word vectors for them."},{"from":923.68,"to":927.7,"location":2,"content":"Um, and I then said something about also you could think of"},{"from":927.7,"to":931.75,"location":2,"content":"this word vector as a sort of a mixture of them and maybe your model could separate it."},{"from":931.75,"to":935.07,"location":2,"content":"But it seems like we might want to take that more seriously."},{"from":935.07,"to":937.42,"location":2,"content":"And one way, um,"},{"from":937.42,"to":943.35,"location":2,"content":"that we could take that more seriously is we could start to say, well,"},{"from":943.35,"to":950.07,"location":2,"content":"really, you know, traditional lists of word senses are themselves a crude approximation."},{"from":950.07,"to":957.92,"location":2,"content":"What we actually want to know is the sense of the word inside a particular context of use."},{"from":957.92,"to":960.4,"location":2,"content":"And sort of what I mean by that is, you know,"},{"from":960.4,"to":964.57,"location":2,"content":"we distinguish different senses of a word, right?"},{"from":964.57,"to":968.38,"location":2,"content":"Say for the word star there's the astronomical sense and"},{"from":968.38,"to":972.41,"location":2,"content":"there's the Hollywood sense and they're clearly different."},{"from":972.41,"to":976.27,"location":2,"content":"But you know, if we then go to this what I'm calling the Hollywood sense,"},{"from":976.27,"to":978.37,"location":2,"content":"I could then say, well, wait a minute."},{"from":978.37,"to":981.52,"location":2,"content":"There are movie stars and there are rock stars,"},{"from":981.52,"to":984.07,"location":2,"content":"and there, uh, are R&B stars,"},{"from":984.07,"to":985.83,"location":2,"content":"and there are country stars."},{"from":985.83,"to":989.27,"location":2,"content":"Now, all of those different senses, um,"},{"from":989.27,"to":993.02,"location":2,"content":"in certain contexts, though, one or other of them would be evoked."},{"from":993.02,"to":994.21,"location":2,"content":"And so, you know,"},{"from":994.21,"to":996.88,"location":2,"content":"it's very hard if you're trying to actually enumerate"},{"from":996.88,"to":1000.83,"location":2,"content":"senses of a word as to which ones count as different or the same."},{"from":1000.83,"to":1004.78,"location":2,"content":"So, it's really you sort of wanna know what a word means in a context."},{"from":1004.78,"to":1010.51,"location":2,"content":"There's a second limitation of these word vectors which is,"},{"from":1010.51,"to":1013.71,"location":2,"content":"we haven't really talked about and is less obvious,"},{"from":1013.71,"to":1016.77,"location":2,"content":"but it's also something that we might want to fix, and at least one of"},{"from":1016.77,"to":1020.07,"location":2,"content":"the models we discussed today takes some aim at that,"},{"from":1020.07,"to":1024.05,"location":2,"content":"and that is, we just sort of have one vector for a word."},{"from":1024.05,"to":1027.46,"location":2,"content":"But there are sort of different dimensions of a word."},{"from":1027.46,"to":1030.39,"location":2,"content":"So, words can have different meanings,"},{"from":1030.39,"to":1034.61,"location":2,"content":"some sort of real semantics or words can have"},{"from":1034.61,"to":1039.77,"location":2,"content":"different syntactic behavior like different parts of speech or grammatical behavior."},{"from":1039.77,"to":1043.07,"location":2,"content":"So, in some sense, arrive and arrival,"},{"from":1043.07,"to":1045.67,"location":2,"content":"their semantics are almost the same,"},{"from":1045.67,"to":1048.99,"location":2,"content":"but they're different parts of speech."},{"from":1048.99,"to":1052.41,"location":2,"content":"One is a, um, a verb and one is a noun,"},{"from":1052.41,"to":1055.38,"location":2,"content":"so they can kind of appear in quite different places."},{"from":1055.38,"to":1059.13,"location":2,"content":"And you know, you'd wanna do different things with them in a dependency parser."},{"from":1059.13,"to":1061.29,"location":2,"content":"And there are even other dimensions."},{"from":1061.29,"to":1067.2,"location":2,"content":"So, words also have register and connotation differences."},{"from":1067.2,"to":1072.27,"location":2,"content":"So, you can probably think of lots of different words for a bathroom,"},{"from":1072.27,"to":1076.17,"location":2,"content":"and a lot of those words all means semantically the same,"},{"from":1076.17,"to":1078.33,"location":2,"content":"but have rather different registers and"},{"from":1078.33,"to":1081.22,"location":2,"content":"connotations as to when they're appropriate to use."},{"from":1081.22,"to":1084.9,"location":2,"content":"And so, we might want to distinguish words on that basis as well."},{"from":1084.9,"to":1088.35,"location":2,"content":"And so these are the kinds of soluti- things we want to"},{"from":1088.35,"to":1091.85,"location":2,"content":"solve with our new contextual word embeddings."},{"from":1091.85,"to":1096.24,"location":2,"content":"Um, so I've said up until now, you know,"},{"from":1096.24,"to":1101.67,"location":2,"content":"oh, we just had these word vectors that we use,"},{"from":1101.67,"to":1104.2,"location":2,"content":"words just had one vector."},{"from":1104.2,"to":1109.17,"location":2,"content":"Um, but if you actually think about it, maybe that's wrong."},{"from":1109.17,"to":1114.27,"location":2,"content":"I mean, maybe we never had a problem, or at any rate, we solved it six classes ago."},{"from":1114.27,"to":1116.2,"location":2,"content":"Because if you remember back, [NOISE] um,"},{"from":1116.2,"to":1119.46,"location":2,"content":"to when we started talking about neural language models,"},{"from":1119.46,"to":1121.95,"location":2,"content":"well, what did a neural language model do?"},{"from":1121.95,"to":1125.1,"location":2,"content":"At the bottom, you fed into it the word vectors."},{"from":1125.1,"to":1129.6,"location":2,"content":"But then you ran across that one or more recurrent layers,"},{"from":1129.6,"to":1131.57,"location":2,"content":"something like a LSTM layer,"},{"from":1131.57,"to":1137.43,"location":2,"content":"and it was calculating these representations that sit above each word and,"},{"from":1137.43,"to":1140.76,"location":2,"content":"you know, the role of those hidden states is a bit ambivalent."},{"from":1140.76,"to":1142.26,"location":2,"content":"They are used for prediction."},{"from":1142.26,"to":1146.67,"location":2,"content":"And they are used for next hidden state and output states and so on."},{"from":1146.67,"to":1149.2,"location":2,"content":"But in many ways you can think huh,"},{"from":1149.2,"to":1156.05,"location":2,"content":"these representations are actually representations of a word in context."},{"from":1156.05,"to":1158.89,"location":2,"content":"And if you think about what happened with, uh,"},{"from":1158.89,"to":1161.31,"location":2,"content":"the question answering systems,"},{"from":1161.31,"to":1163.62,"location":2,"content":"that's exactly how they were used, right?"},{"from":1163.62,"to":1166.2,"location":2,"content":"We ran LSTM's backwards and forwards,"},{"from":1166.2,"to":1169.32,"location":2,"content":"over a question in the passage, and then we say,"},{"from":1169.32,"to":1173.09,"location":2,"content":"okay those are a good representation of a word's meaning and context."},{"from":1173.09,"to":1176.2,"location":2,"content":"Let's start matching them with attention functions et cetera."},{"from":1176.2,"to":1181.47,"location":2,"content":"So, it sort of seemed like we'd already invented a way to have,"},{"from":1181.47,"to":1187.18,"location":2,"content":"um, context-specific representations of words."},{"from":1187.18,"to":1190.04,"location":2,"content":"And effectively, you know,"},{"from":1190.04,"to":1195.45,"location":2,"content":"the rest of the content of this lecture is sort of basically no more complex than that."},{"from":1195.45,"to":1202.24,"location":2,"content":"Um, that it took a while but sort of people woke up and started to notice, huh,"},{"from":1202.24,"to":1204.6,"location":2,"content":"really when you're running any language model,"},{"from":1204.6,"to":1208.37,"location":2,"content":"you generate a context-specific representation of words."},{"from":1208.37,"to":1211.36,"location":2,"content":"Maybe, if we just took those context-specific"},{"from":1211.36,"to":1216.81,"location":2,"content":"representation of words, they'd be useful for doing other things with them."},{"from":1216.81,"to":1218.6,"location":2,"content":"And that's sort of, you know,"},{"from":1218.6,"to":1219.78,"location":2,"content":"there are a few more details,"},{"from":1219.78,"to":1223.98,"location":2,"content":"but that's really the summary of the entire of this lecture."},{"from":1223.98,"to":1234.16,"location":2,"content":"Um, so one of the first things to do that was a paper that Matt Peters wrote in 2017,"},{"from":1234.16,"to":1236.27,"location":2,"content":"um, the year before last."},{"from":1236.27,"to":1241.08,"location":2,"content":"Um, and this was sort of a predecessor to the sort of modern, um,"},{"from":1241.08,"to":1246.72,"location":2,"content":"versions of, um, these context-sensitive word embeddings."},{"from":1246.72,"to":1249.84,"location":2,"content":"So, um, together with co-authors,"},{"from":1249.84,"to":1253.18,"location":2,"content":"he came up with a paper called TagLM,"},{"from":1253.18,"to":1256.92,"location":2,"content":"but it essentially already had all the main ideas."},{"from":1256.92,"to":1261.26,"location":2,"content":"So, what, um, was wanted was okay."},{"from":1261.26,"to":1265.62,"location":2,"content":"We want to do better at tasks such as named-entity recognition."},{"from":1265.62,"to":1270.94,"location":2,"content":"And what we'd like to do is know about the meaning of a word in context."},{"from":1270.94,"to":1274.5,"location":2,"content":"Um, but you know, standardly if we're doing named-entity recognition,"},{"from":1274.5,"to":1278.17,"location":2,"content":"we just train it on half a million words of supervised data."},{"from":1278.17,"to":1280.23,"location":2,"content":"And that's not much of a source of"},{"from":1280.23,"to":1283.95,"location":2,"content":"information to be learning about the meaning of words and context."},{"from":1283.95,"to":1288.81,"location":2,"content":"So, why don't we adopt the semi-supervised approach and so that's what we do."},{"from":1288.81,"to":1292.74,"location":2,"content":"So, we start off with a ton of unlabeled data."},{"from":1292.74,"to":1295.55,"location":2,"content":"Um, and from that unlabeled data,"},{"from":1295.55,"to":1299.85,"location":2,"content":"we can train a conventional word embedding model like Word2vec."},{"from":1299.85,"to":1303.81,"location":2,"content":"But we can also at the same time train a neural language model."},{"from":1303.81,"to":1307.59,"location":2,"content":"So, something like a bi-LSTM language model."},{"from":1307.59,"to":1315.74,"location":2,"content":"Okay. So, then for step two when we're using our supervised data,"},{"from":1315.74,"to":1318.9,"location":2,"content":"um, actually, I guess that's step three."},{"from":1318.9,"to":1325.96,"location":2,"content":"Okay. Um, so for then when we want to learn our supervised part-of-speech tagger at the top,"},{"from":1325.96,"to":1329.19,"location":2,"content":"what we're gonna do is say, well,"},{"from":1329.19,"to":1333.42,"location":2,"content":"for the input words New what York is located,"},{"from":1333.42,"to":1338.34,"location":2,"content":"we can not only use the word embedding which is context independent,"},{"from":1338.34,"to":1344.51,"location":2,"content":"but we can use our trained recurrent language model and also run it over this import,"},{"from":1344.51,"to":1351.18,"location":2,"content":"and then we'll generate hidden states in our bi-LSTM language model and we can also"},{"from":1351.18,"to":1358.38,"location":2,"content":"feed those in as features into ou- our sequence tagging model,"},{"from":1358.38,"to":1361.34,"location":2,"content":"and those features will let it work better."},{"from":1361.34,"to":1367.1,"location":2,"content":"Here's a second picture that runs this through in much greater detail."},{"from":1367.1,"to":1372.88,"location":2,"content":"So, so, we're assuming that we have trained, uh,"},{"from":1372.88,"to":1376.88,"location":2,"content":"bi-LSTM language model, um,"},{"from":1376.88,"to":1379.76,"location":2,"content":"on a lot of unsupervised data."},{"from":1379.76,"to":1386.37,"location":2,"content":"Then what we wanna do is we want to do named entity recognition for New York is located."},{"from":1386.37,"to":1389.16,"location":2,"content":"So, the first thing we do is say,"},{"from":1389.16,"to":1396.15,"location":2,"content":"let's just run New York is located through our separately trained neural language model."},{"from":1396.15,"to":1398.92,"location":2,"content":"So, we run it through a forward language model."},{"from":1398.92,"to":1401.49,"location":2,"content":"We run it through a backward language model."},{"from":1401.49,"to":1403.83,"location":2,"content":"We get from that, um,"},{"from":1403.83,"to":1406.52,"location":2,"content":"a hidden state representation,"},{"from":1406.52,"to":1408.75,"location":2,"content":"um, for each word,"},{"from":1408.75,"to":1411.64,"location":2,"content":"we concatenate the forward and backward ones,"},{"from":1411.64,"to":1415.53,"location":2,"content":"and that's going to give a set, a concatenated language model embedding"},{"from":1415.53,"to":1420.09,"location":2,"content":"which we'll use as features in our named entity recognizer."},{"from":1420.09,"to":1423.87,"location":2,"content":"So, then for the named entity recognizer itself that we're gonna"},{"from":1423.87,"to":1428.7,"location":2,"content":"train supervised while we have the same sentence,"},{"from":1428.7,"to":1435.39,"location":2,"content":"so we can both look up a Word2vec-style token embedding for it."},{"from":1435.39,"to":1441.32,"location":2,"content":"We can use what we learned about with character level CNNs and RNNs and we can build"},{"from":1441.32,"to":1444.45,"location":2,"content":"a character level representation for it which we also"},{"from":1444.45,"to":1447.8,"location":2,"content":"concatenate to have two representations."},{"from":1447.8,"to":1455.68,"location":2,"content":"So, we feed these representations into a bi-LSTM layer."},{"from":1455.68,"to":1459.94,"location":2,"content":"But then when we get the output of the, this bi-LSTM layer,"},{"from":1459.94,"to":1462.18,"location":2,"content":"as well as this normal output,"},{"from":1462.18,"to":1468.29,"location":2,"content":"we can concatenate with each output what was- what we get from our,"},{"from":1468.29,"to":1470.73,"location":2,"content":"um, neural language model."},{"from":1470.73,"to":1473.37,"location":2,"content":"So, each of these things becomes a pair of states."},{"from":1473.37,"to":1476.49,"location":2,"content":"One that's spit up from the first bi-LSTM layer and"},{"from":1476.49,"to":1479.76,"location":2,"content":"then it's concatenated with something from the neural language model."},{"from":1479.76,"to":1486.45,"location":2,"content":"And so that concatenated representation is then fed into a second layer of bi-LSTM."},{"from":1486.45,"to":1488.27,"location":2,"content":"And then from the output of that,"},{"from":1488.27,"to":1491.31,"location":2,"content":"we do the usual kind of softmax classification"},{"from":1491.31,"to":1494.79,"location":2,"content":"where we're then giving tags like beginning of location,"},{"from":1494.79,"to":1499.38,"location":2,"content":"end of location, say New York is a location and then is, we'll get"},{"from":1499.38,"to":1507.86,"location":2,"content":"another tag to say it's not a location. Does that makes sense?"},{"from":1507.86,"to":1514.31,"location":2,"content":"Yeah so, um, so the central thing is"},{"from":1514.31,"to":1520.45,"location":2,"content":"sort of having seen that these sort of representations that we get from Bi-LSTMs are useful."},{"from":1520.45,"to":1524.58,"location":2,"content":"We're just going to feed them into supervised models as we train them,"},{"from":1524.58,"to":1528.6,"location":2,"content":"and the idea is that will give us better features of words."},{"from":1528.6,"to":1532.31,"location":2,"content":"Some kind of representation of their meaning and context,"},{"from":1532.31,"to":1539.61,"location":2,"content":"which will allow us to learn better named entity recognizers or what it- whatever it is."},{"from":1539.61,"to":1542.58,"location":2,"content":"Maybe I should put this slide earlier,"},{"from":1542.58,"to":1545.95,"location":2,"content":"but this slide was meant to remind you what a named entity recognizer is."},{"from":1545.95,"to":1547.31,"location":2,"content":"I hope you remember that,"},{"from":1547.31,"to":1550.61,"location":2,"content":"something where are we going to find and label"},{"from":1550.61,"to":1554.85,"location":2,"content":"entities for things like person, location, date, organization."},{"from":1554.85,"to":1557.63,"location":2,"content":"So anyway, doing this worked."},{"from":1557.63,"to":1559.9,"location":2,"content":"So, here's a little bit of a history."},{"from":1559.9,"to":1567.29,"location":2,"content":"So the most famous Named Entity Recognition dataset is this CoNLL 2003 dataset,"},{"from":1567.29,"to":1570.18,"location":2,"content":"which actually exists in multiple languages."},{"from":1570.18,"to":1574.77,"location":2,"content":"But whenever people say CoNLL 2003 and don't mention a language,"},{"from":1574.77,"to":1577.66,"location":2,"content":"they mean the English version of it."},{"from":1577.66,"to":1580.41,"location":2,"content":"That's the way the world works."},{"from":1580.41,"to":1584.43,"location":2,"content":"Um, okay so on this dataset- yeah."},{"from":1584.43,"to":1589.04,"location":2,"content":"So, it's sort of been around for whatever, 15 years roughly now."},{"from":1589.04,"to":1592.92,"location":2,"content":"So, in the- so it was originally a competition, right?"},{"from":1592.92,"to":1596.23,"location":2,"content":"So, this is in 2003 was the original bake-off."},{"from":1596.23,"to":1598.45,"location":2,"content":"My group actually took place in that."},{"from":1598.45,"to":1602.06,"location":2,"content":"Took part in it. I think we got third or fourth place or something,"},{"from":1602.06,"to":1606.72,"location":2,"content":"and our F1 score was 86."},{"from":1606.72,"to":1612.81,"location":2,"content":"The people who won were from IBM Research Labs,"},{"from":1612.81,"to":1615.87,"location":2,"content":"and they got 88 almost 89."},{"from":1615.87,"to":1620.49,"location":2,"content":"But a difference between these two things is our system was"},{"from":1620.49,"to":1625.29,"location":2,"content":"a single clean machine-learning model categorical,"},{"from":1625.29,"to":1628.8,"location":2,"content":"whereas the IBM one was not only an ensemble"},{"from":1628.8,"to":1633.6,"location":2,"content":"of four different machine learning models, plus gazetteers."},{"from":1633.6,"to":1636.09,"location":2,"content":"It also fit in the output of"},{"from":1636.09,"to":1642.45,"location":2,"content":"two other old NER systems that IBM people were trained years ago on different data."},{"from":1642.45,"to":1645.03,"location":2,"content":"So it was- I guess it worked for them but,"},{"from":1645.03,"to":1647.1,"location":2,"content":"it was a fairly complex system."},{"from":1647.1,"to":1649.17,"location":2,"content":"Here's another system from Stanford."},{"from":1649.17,"to":1653.91,"location":2,"content":"So this was our classic Stanford NER system that is widely used."},{"from":1653.91,"to":1659.47,"location":2,"content":"So, this was then using a conditional random field model which generally dominated"},{"from":1659.47,"to":1666.93,"location":2,"content":"sort of the second half of the 2000s and the first half of the 2010s for doing NER,"},{"from":1666.93,"to":1672.8,"location":2,"content":"and it was sort of, you know, a bit but not usually better than the 2003 system."},{"from":1672.8,"to":1679.91,"location":2,"content":"This system here was sort of the best ever built categorical CRF system."},{"from":1679.91,"to":1686.11,"location":2,"content":"But rather than only using the training data to build the model as this system did,"},{"from":1686.11,"to":1691.07,"location":2,"content":"it threw in Wikipedia and other stuff to make it work better,"},{"from":1691.07,"to":1694.72,"location":2,"content":"and that got you to about 90.8 F1."},{"from":1694.72,"to":1703.77,"location":2,"content":"So, essentially, once sort of BiLSTM style models started to be known and used in NLP."},{"from":1703.77,"to":1708.06,"location":2,"content":"That was when people were able to train, build training"},{"from":1708.06,"to":1713.17,"location":2,"content":"just on the training data systems that worked a lot better."},{"from":1713.17,"to":1718.44,"location":2,"content":"Because essentially you're going from the same data from this system to that system."},{"from":1718.44,"to":1721.53,"location":2,"content":"So, you're getting about 4 percent gain on it,"},{"from":1721.53,"to":1725.84,"location":2,"content":"because it's not- wasn't making use of Wikipedia and things like that;"},{"from":1725.84,"to":1731.81,"location":2,"content":"and so this Ma and Hovy system is pretty well-known getting about 91.21."},{"from":1731.81,"to":1736.14,"location":2,"content":"Okay, but if we then go to this TagLM system, um,"},{"from":1736.14,"to":1740.61,"location":2,"content":"that Matt Peters and Co have a system that"},{"from":1740.61,"to":1745.59,"location":2,"content":"was sort of similar to the Ma and Hovy system that is a little bit worse."},{"from":1745.59,"to":1752.67,"location":2,"content":"But the point is that this BiLSTM uses sorry- using the neural language model,"},{"from":1752.67,"to":1757.08,"location":2,"content":"is just a useful oomph giver which sort of takes the results up."},{"from":1757.08,"to":1758.61,"location":2,"content":"Yeah, not night and day but,"},{"from":1758.61,"to":1764.16,"location":2,"content":"slightly over a percent and then gives them the best NER system that was then available."},{"from":1764.16,"to":1765.99,"location":2,"content":"So that sort of proved these sort of"},{"from":1765.99,"to":1773.66,"location":2,"content":"contextual word representations really had some power and started to be useful,"},{"from":1773.66,"to":1778.62,"location":2,"content":"and then there's a white space at the top because we'll get back to more of this later."},{"from":1778.62,"to":1783.24,"location":2,"content":"Um, there's some details on their language model."},{"from":1783.24,"to":1786.33,"location":2,"content":"Some of their details are that it's useful to have"},{"from":1786.33,"to":1789.29,"location":2,"content":"a bidirectional language model, not unidirectional."},{"from":1789.29,"to":1791.64,"location":2,"content":"It's useful to have a big um,"},{"from":1791.64,"to":1795.51,"location":2,"content":"language model to get much in the way of gains,"},{"from":1795.51,"to":1801.96,"location":2,"content":"um and, you need to train this language model over much more data."},{"from":1801.96,"to":1808.16,"location":2,"content":"It doesn't work if you're just sort of training it over your supervised training data."},{"from":1808.16,"to":1811.14,"location":2,"content":"Another model that was around was CoVe,"},{"from":1811.14,"to":1812.61,"location":2,"content":"but I think I'll skip that."},{"from":1812.61,"to":1815.89,"location":2,"content":"Okay. So, then the next year, um,"},{"from":1815.89,"to":1818.87,"location":2,"content":"Matt Peters and a different set of colleagues"},{"from":1818.87,"to":1823.41,"location":2,"content":"then came up with an improved system called ELMo,"},{"from":1823.41,"to":1827.61,"location":2,"content":"and effectively this was the breakthrough system."},{"from":1827.61,"to":1830.96,"location":2,"content":"That this was sort of just the system that everybody"},{"from":1830.96,"to":1835.88,"location":2,"content":"noticed and said \"Wow these contextual word vectors are great."},{"from":1835.88,"to":1837.68,"location":2,"content":"Everyone should be using them,"},{"from":1837.68,"to":1841.79,"location":2,"content":"not traditional word vectors.\" Yes?"},{"from":1841.79,"to":1859.33,"location":2,"content":"I have a simple question, imagine re-training a system, what exactly"},{"from":1859.33,"to":1862.91,"location":2,"content":"what measure [inaudible]"},{"from":1862.91,"to":1866.25,"location":2,"content":"It's pre-trained because this piece over here;"},{"from":1866.25,"to":1871.04,"location":2,"content":"a big neural language model is trained first,"},{"from":1871.04,"to":1873.27,"location":2,"content":"and there's an important thing I forgot to say."},{"from":1873.27,"to":1875.28,"location":2,"content":"So, thank you for the question."},{"from":1875.28,"to":1880.02,"location":2,"content":"The main reason why it's- in some sense pre-trained,"},{"from":1880.02,"to":1881.67,"location":2,"content":"is this was trained first."},{"from":1881.67,"to":1886.24,"location":2,"content":"But the main reason why people think of this as pre-training"},{"from":1886.24,"to":1890.98,"location":2,"content":"is after you've trained this, it is frozen."},{"from":1890.98,"to":1895.68,"location":2,"content":"So, this is just something that you can run with parameters which will give"},{"from":1895.68,"to":1900.84,"location":2,"content":"you a vector which is your contextual word representation each position,"},{"from":1900.84,"to":1903.96,"location":2,"content":"and then that's just going to be used in this system."},{"from":1903.96,"to":1906.42,"location":2,"content":"So, when you're training this system,"},{"from":1906.42,"to":1908.58,"location":2,"content":"there's no gradient flowing back into"},{"from":1908.58,"to":1912.88,"location":2,"content":"this neural language model that's changing and updating it; it's just fixed."},{"from":1912.88,"to":1916.26,"location":2,"content":"And so that's sort of the sense when people are talking about pre-training."},{"from":1916.26,"to":1919.18,"location":2,"content":"It's sort of normally a model that you trained"},{"from":1919.18,"to":1922.68,"location":2,"content":"somewhere else and that you're using to give features,"},{"from":1922.68,"to":1926.28,"location":2,"content":"but isn't part of the model that you are now training. Yeah?"},{"from":1926.28,"to":1932.06,"location":2,"content":"[inaudible]"},{"from":1932.06,"to":1936.65,"location":2,"content":"Well, I guess that's, I wouldn't quite call it reconstruction."},{"from":1936.65,"to":1940.19,"location":2,"content":"Yeah, it's unsupervised in the sense that this is a language model,"},{"from":1940.19,"to":1942.47,"location":2,"content":"you're training it to predict the next word."},{"from":1942.47,"to":1948.34,"location":2,"content":"So here are words one to k. What is the k plus oneth word during a cross entropy loss,"},{"from":1948.34,"to":1950.15,"location":2,"content":"and repeat over for each position."},{"from":1950.15,"to":1957.53,"location":2,"content":"[NOISE] Yes, so I mean,"},{"from":1957.53,"to":1965.24,"location":2,"content":"having gone through TagLM in some detail, I mean,"},{"from":1965.24,"to":1972.35,"location":2,"content":"in some sense, the difference between TagLM and ELMo is kind of small,"},{"from":1972.35,"to":1974.09,"location":2,"content":"it's sort of in the details."},{"from":1974.09,"to":1976.38,"location":2,"content":"So I mean, to a first approximation,"},{"from":1976.38,"to":1978.89,"location":2,"content":"they're doing exactly the same again,"},{"from":1978.89,"to":1980.67,"location":2,"content":"but a little bit better."},{"from":1980.67,"to":1986.36,"location":2,"content":"Um, so, um, I sort of hope it made sense the last time,"},{"from":1986.36,"to":1989.02,"location":2,"content":"I mean, what are the things that are different?"},{"from":1989.02,"to":1993.71,"location":2,"content":"Um, they do the bidirectional language model a bit differently,"},{"from":1993.71,"to":1996.8,"location":2,"content":"and actually one of their concerns was to try and come up with"},{"from":1996.8,"to":2001.43,"location":2,"content":"a compact language model that would be easy for people to use,"},{"from":2001.43,"to":2007.39,"location":2,"content":"um, in other tasks even if they don't have the beefiest computer hardware in the world."},{"from":2007.39,"to":2009.94,"location":2,"content":"And so they decided to dispense with having"},{"from":2009.94,"to":2014.18,"location":2,"content":"word representations altogether and just use, um,"},{"from":2014.18,"to":2018.61,"location":2,"content":"character CNNs to build word representations,"},{"from":2018.61,"to":2022.05,"location":2,"content":"because that lessens the number of parameters you have to store,"},{"from":2022.05,"to":2025.51,"location":2,"content":"the big matrices you have to, um, use."},{"from":2025.51,"to":2030.28,"location":2,"content":"Um, they expanded the hidden dimension to 4,096,"},{"from":2030.28,"to":2032.02,"location":2,"content":"but then they project it down to"},{"from":2032.02,"to":2037.45,"location":2,"content":"512 dimensions with a sort of feed-forward projection layer,"},{"from":2037.45,"to":2040.3,"location":2,"content":"and that's a fairly common technique to again reduce"},{"from":2040.3,"to":2043.36,"location":2,"content":"the parameterization of the model so that you have a lot of"},{"from":2043.36,"to":2046.06,"location":2,"content":"parameters going in their current direction but you"},{"from":2046.06,"to":2049.32,"location":2,"content":"need much smaller matrices for including,"},{"from":2049.32,"to":2051.4,"location":2,"content":"um, the input at the next level."},{"from":2051.4,"to":2053.53,"location":2,"content":"Um, between the layers,"},{"from":2053.53,"to":2058.3,"location":2,"content":"they now use a residual connection and they do a bit of parameter tying."},{"from":2058.3,"to":2061.61,"location":2,"content":"So it's sort of all in the little details there."},{"from":2061.61,"to":2065.2,"location":2,"content":"Um, but there's another interesting thing"},{"from":2065.2,"to":2068.89,"location":2,"content":"that they did which was an important innovation of ELMo,"},{"from":2068.89,"to":2070.41,"location":2,"content":"so we should get this bit."},{"from":2070.41,"to":2072.4,"location":2,"content":"So in TagLM,"},{"from":2072.4,"to":2076.93,"location":2,"content":"what was fed from the pre-trained LM into"},{"from":2076.93,"to":2083.7,"location":2,"content":"the main model was just the top level of the neural language model stack,"},{"from":2083.7,"to":2087.04,"location":2,"content":"and that was completely standard de rigueur in those days,"},{"from":2087.04,"to":2089.8,"location":2,"content":"that you might have had three layers of"},{"from":2089.8,"to":2093.79,"location":2,"content":"neural language model that you regard at the top-level as your sort"},{"from":2093.79,"to":2097.12,"location":2,"content":"of one that's really captured the meaning of"},{"from":2097.12,"to":2101.18,"location":2,"content":"the sentence and the lower layers for processing that led up to it."},{"from":2101.18,"to":2105.3,"location":2,"content":"Um, and they had the idea that maybe"},{"from":2105.3,"to":2109.78,"location":2,"content":"it would be useful to actually use all layers of the,"},{"from":2109.78,"to":2112.96,"location":2,"content":"biLSTM of the neural language models."},{"from":2112.96,"to":2116.93,"location":2,"content":"So maybe not just the top layer but all layers would be kind of useful."},{"from":2116.93,"to":2120.76,"location":2,"content":"So, um, there are these kind of complex equations,"},{"from":2120.76,"to":2124.48,"location":2,"content":"uh, but essentially the point of it over here is,"},{"from":2124.48,"to":2127.36,"location":2,"content":"we going- for a particular position,"},{"from":2127.36,"to":2129.51,"location":2,"content":"word seven in the language model,"},{"from":2129.51,"to":2133.93,"location":2,"content":"we're going to take the hidden state at each level of our,"},{"from":2133.93,"to":2136.6,"location":2,"content":"our neural language model stack,"},{"from":2136.6,"to":2140.55,"location":2,"content":"we're going to give- learn a weight for that level,"},{"from":2140.55,"to":2142.54,"location":2,"content":"we go in to sort of sum them,"},{"from":2142.54,"to":2147.19,"location":2,"content":"so this is sort of a weighted average of the hidden layers at each position,"},{"from":2147.19,"to":2151.22,"location":2,"content":"and that will be used as our basic representation."},{"from":2151.22,"to":2155.78,"location":2,"content":"Um, and so, they found that that gave quite a bit"},{"from":2155.78,"to":2160.48,"location":2,"content":"of extra usefulness for- and different tasks could prefer different layers."},{"from":2160.48,"to":2163.05,"location":2,"content":"There's one other bit here which is,"},{"from":2163.05,"to":2168.63,"location":2,"content":"they learn a global scaling factor Gamma for a particular task."},{"from":2168.63,"to":2173.66,"location":2,"content":"And this allows them to control that for some tasks, the, um,"},{"from":2173.66,"to":2176.08,"location":2,"content":"contextual word embeddings might be really"},{"from":2176.08,"to":2179.51,"location":2,"content":"useful and for other tasks they might not be so useful,"},{"from":2179.51,"to":2181.45,"location":2,"content":"so you're just sort of learning a specific,"},{"from":2181.45,"to":2185.09,"location":2,"content":"um, usefulness for the entire task."},{"from":2185.09,"to":2190.28,"location":2,"content":"Okay. So, um, that's the sort of new version of language model."},{"from":2190.28,"to":2193.39,"location":2,"content":"But this, this is allowing this idea of well,"},{"from":2193.39,"to":2196.75,"location":2,"content":"maybe there's sort of more syntactic meanings"},{"from":2196.75,"to":2199.86,"location":2,"content":"of a word and more semantic meanings of a word,"},{"from":2199.86,"to":2203.38,"location":2,"content":"possibly those could be represented at different layers of"},{"from":2203.38,"to":2205.51,"location":2,"content":"your neural language model and then for"},{"from":2205.51,"to":2208.33,"location":2,"content":"different tasks you can differentially weight them."},{"from":2208.33,"to":2211.33,"location":2,"content":"Um, so that's the basic model."},{"from":2211.33,"to":2216.85,"location":2,"content":"So you run your biLSTM before to g et representations of each word."},{"from":2216.85,"to":2219.61,"location":2,"content":"And then the generic ELMo recipe was,"},{"from":2219.61,"to":2223.22,"location":2,"content":"well, with that frozen language model,"},{"from":2223.22,"to":2228.54,"location":2,"content":"you want to feed it into some supervised model depending on what the task was,"},{"from":2228.54,"to":2230.07,"location":2,"content":"and they sort of say in the paper, well,"},{"from":2230.07,"to":2232.5,"location":2,"content":"how you do this maybe depends on the task."},{"from":2232.5,"to":2235.97,"location":2,"content":"You might want to kind of concatenate it to the intermediate layer,"},{"from":2235.97,"to":2237.66,"location":2,"content":"just as the TagLM did,"},{"from":2237.66,"to":2239.09,"location":2,"content":"that might be fine."},{"from":2239.09,"to":2242.22,"location":2,"content":"But you know it might also be useful to make use of"},{"from":2242.22,"to":2245.7,"location":2,"content":"these ELMo representations when producing outputs,"},{"from":2245.7,"to":2248.91,"location":2,"content":"so if you're doing something like a"},{"from":2248.91,"to":2255.21,"location":2,"content":"generation system or you might just sort of feed in the ELMo representation again,"},{"from":2255.21,"to":2258.63,"location":2,"content":"be- before you sort of do the softmax to find the output,"},{"from":2258.63,"to":2261.58,"location":2,"content":"they sort of left it flexible as to how it was used,"},{"from":2261.58,"to":2262.96,"location":2,"content":"but the general picture,"},{"from":2262.96,"to":2265.96,"location":2,"content":"you know, was kinda like we saw before."},{"from":2265.96,"to":2269.59,"location":2,"content":"Indeed I'm reusing the same picture that you've calculated"},{"from":2269.59,"to":2274.11,"location":2,"content":"an ELMo representation for each position as a weighted average,"},{"from":2274.11,"to":2277.36,"location":2,"content":"and then you're sort of concatenating that to the hidden state of"},{"from":2277.36,"to":2281.13,"location":2,"content":"your supervised system and generating your output."},{"from":2281.13,"to":2284.89,"location":2,"content":"And anyway, um, one way or another,"},{"from":2284.89,"to":2287.92,"location":2,"content":"um, they were able to do this, uh,"},{"from":2287.92,"to":2291.93,"location":2,"content":"and that with the little improvements that gave them about an extra"},{"from":2291.93,"to":2296.77,"location":2,"content":"0.3 percent in Named Entity Recognition."},{"from":2296.77,"to":2301.16,"location":2,"content":"Um, now, that sort of sounds like not very much."},{"from":2301.16,"to":2306.05,"location":2,"content":"And you might conclude from this why the excitement [LAUGHTER] and,"},{"from":2306.05,"to":2308.7,"location":2,"content":"you know, in some sense, um,"},{"from":2308.7,"to":2313.72,"location":2,"content":"that's right because sort of to the extent that there was an interesting idea here really"},{"from":2313.72,"to":2319.06,"location":2,"content":"that come up with it for the TagLM paper which gave a much better gain."},{"from":2319.06,"to":2325.25,"location":2,"content":"But, you know, why everyone got really excited was that in the ELMo paper,"},{"from":2325.25,"to":2328.03,"location":2,"content":"they then showed this isn't something that you can"},{"from":2328.03,"to":2330.91,"location":2,"content":"do one-off to improve a Named Entity Recognizer,"},{"from":2330.91,"to":2338.03,"location":2,"content":"you can take these ELMo representations and use them for pretty much any NLP task,"},{"from":2338.03,"to":2341.7,"location":2,"content":"and they can be very useful and give good gains."},{"from":2341.7,"to":2348.34,"location":2,"content":"And so, essentially why people got excited was because of the data that's in this table."},{"from":2348.34,"to":2351.25,"location":2,"content":"So here we're taking a whole bunch of very different tasks,"},{"from":2351.25,"to":2353.62,"location":2,"content":"so there's SQuAD question-answering, uh,"},{"from":2353.62,"to":2356.38,"location":2,"content":"there's natural language inference,"},{"from":2356.38,"to":2358.34,"location":2,"content":"there's semantic role labeling,"},{"from":2358.34,"to":2363.76,"location":2,"content":"there's co-reference, the Named Entity Recognition, doing sentiment analysis,"},{"from":2363.76,"to":2366.73,"location":2,"content":"so a wide range of different NLP tasks,"},{"from":2366.73,"to":2370.32,"location":2,"content":"and they have a previous state of the art system."},{"from":2370.32,"to":2374.86,"location":2,"content":"They produced their own baseline um, which is,"},{"from":2374.86,"to":2380.08,"location":2,"content":"you know, commonly sort of similar to the previous state of the art,"},{"from":2380.08,"to":2383.62,"location":2,"content":"but usually actually a bit worse than"},{"from":2383.62,"to":2385.36,"location":2,"content":"the current state of the art because it's"},{"from":2385.36,"to":2388.32,"location":2,"content":"whatever simpler cleaner system that they came up with,"},{"from":2388.32,"to":2391.34,"location":2,"content":"but then they could say in each case,"},{"from":2391.34,"to":2395.26,"location":2,"content":"oh, just take this system and add"},{"from":2395.26,"to":2399.99,"location":2,"content":"ELMo vectors into the hidden representations in the middle,"},{"from":2399.99,"to":2402.04,"location":2,"content":"and have those help you predict."},{"from":2402.04,"to":2404.71,"location":2,"content":"And in general, in all cases,"},{"from":2404.71,"to":2408.97,"location":2,"content":"that's giving you about a three percent or so gain absolute"},{"from":2408.97,"to":2413.47,"location":2,"content":"which was then producing this huge performance increase,"},{"from":2413.47,"to":2418.45,"location":2,"content":"which in all cases was moving the performance well above the previous,"},{"from":2418.45,"to":2420.04,"location":2,"content":"um, state of the art system."},{"from":2420.04,"to":2424,"location":2,"content":"So you know, this sort of then made it seem like magic pixie dust,"},{"from":2424,"to":2428.05,"location":2,"content":"because, you know, in the stakes of NLP conference land, you know,"},{"from":2428.05,"to":2430.96,"location":2,"content":"a lot of people use to try and to come up"},{"from":2430.96,"to":2434.5,"location":2,"content":"with a paper for the next year that's one percent better"},{"from":2434.5,"to":2437.08,"location":2,"content":"on one task and writing it up and that's"},{"from":2437.08,"to":2441.72,"location":2,"content":"their big breakthrough for the year to get their new paper out."},{"from":2441.72,"to":2444.36,"location":2,"content":"And the idea that there's just well this set of"},{"from":2444.36,"to":2448.05,"location":2,"content":"this way of creating context sensitive, um,"},{"from":2448.05,"to":2451.66,"location":2,"content":"word representations and you just use them in any task,"},{"from":2451.66,"to":2455.24,"location":2,"content":"and they'll give you around three percent and take you past the state of the art,"},{"from":2455.24,"to":2458.39,"location":2,"content":"this seemed like it was really great stuff."},{"from":2458.39,"to":2461.8,"location":2,"content":"And so people got very excited about this and that won"},{"from":2461.8,"to":2466.39,"location":2,"content":"the Best Paper Award at the NAACL 2018 conference."},{"from":2466.39,"to":2470.59,"location":2,"content":"Ah, and then, a- as I sort of vaguely mentioned,"},{"from":2470.59,"to":2474.37,"location":2,"content":"um, so the model that they actually used wasn't a deep stack,"},{"from":2474.37,"to":2477.52,"location":2,"content":"there were actually only two layers of biLSTMs,"},{"from":2477.52,"to":2482.62,"location":2,"content":"but they do show this interesting result that the lower level better captures"},{"from":2482.62,"to":2486.79,"location":2,"content":"low-level syntax word properties"},{"from":2486.79,"to":2490.39,"location":2,"content":"and its most useful things like part-of-speech tagging, syntactic"},{"from":2490.39,"to":2493.21,"location":2,"content":"dependencies, NER, where the top layer of"},{"from":2493.21,"to":2495.31,"location":2,"content":"their language model is better for"},{"from":2495.31,"to":2498.94,"location":2,"content":"higher level semantics that is more useful for things like sentiments,"},{"from":2498.94,"to":2502.49,"location":2,"content":"semantic role labeling and question answering."},{"from":2502.49,"to":2505.15,"location":2,"content":"Um, so that seemed interesting,"},{"from":2505.15,"to":2507.94,"location":2,"content":"though it'll actually be interesting to see how that panned"},{"from":2507.94,"to":2512.1,"location":2,"content":"out more if you had sort of more layers to play with."},{"from":2512.1,"to":2515.88,"location":2,"content":"Okay. ELMo, done."},{"from":2515.88,"to":2518.59,"location":2,"content":"Um, so I'm moving right ahead."},{"from":2518.59,"to":2525.55,"location":2,"content":"Um, here's something else that I just thought I should mention a little bit about,"},{"from":2525.55,"to":2529.27,"location":2,"content":"another piece of work that came out around the same time,"},{"from":2529.27,"to":2532.45,"location":2,"content":"a few months later maybe or maybe not,"},{"from":2532.45,"to":2534.43,"location":2,"content":"came out around the same time, uh,"},{"from":2534.43,"to":2538.42,"location":2,"content":"in, in 2018, was this work on"},{"from":2538.42,"to":2543.03,"location":2,"content":"Universal Language Model Fine-tuning for text classification,"},{"from":2543.03,"to":2545.99,"location":2,"content":"um, or ULMfit, by Howard and Ruder."},{"from":2545.99,"to":2551.34,"location":2,"content":"And essentially this had the same general idea of saying, Well,"},{"from":2551.34,"to":2560.56,"location":2,"content":"what we want to do is transfer learning where we could learn a big language model, um."},{"from":2560.56,"to":2563.07,"location":2,"content":"A big language model,"},{"from":2563.07,"to":2568.22,"location":2,"content":"and then for our target task which might be named entity recognition."},{"from":2568.22,"to":2570.2,"location":2,"content":"But here's text classification,"},{"from":2570.2,"to":2575.69,"location":2,"content":"we can transfer this language model information and help us to do better with the task."},{"from":2575.69,"to":2578.69,"location":2,"content":"And so, they proposed an architecture to do that."},{"from":2578.69,"to":2580.64,"location":2,"content":"And so, their architecture was,"},{"from":2580.64,"to":2587.96,"location":2,"content":"you have a big unsupervised corpus from which you train a neural language model."},{"from":2587.96,"to":2592.78,"location":2,"content":"They used the deeper neural language model with three hidden layers."},{"from":2592.78,"to":2594.92,"location":2,"content":"Um, you then fine tune"},{"from":2594.92,"to":2599.66,"location":2,"content":"your neural language model on the actual domain that you're interested in working in."},{"from":2599.66,"to":2602.26,"location":2,"content":"So, this was sort of an extra stage that they did."},{"from":2602.26,"to":2604.73,"location":2,"content":"And then finally, um,"},{"from":2604.73,"to":2608.96,"location":2,"content":"you now introduce your classification objectives."},{"from":2608.96,"to":2611.93,"location":2,"content":"So, what they're going to be doing is making text classifiers."},{"from":2611.93,"to":2613.53,"location":2,"content":"So, we're now wanting to,"},{"from":2613.53,"to":2619.28,"location":2,"content":"take this model and turn it from a language model into a text classifier."},{"from":2619.28,"to":2622.34,"location":2,"content":"Um, but there's something that they did differently, um,"},{"from":2622.34,"to":2623.72,"location":2,"content":"which is in some sense,"},{"from":2623.72,"to":2626.84,"location":2,"content":"foreshadows the later work in transformers."},{"from":2626.84,"to":2632.21,"location":2,"content":"So, rather than just feeding features from this into a completely different network,"},{"from":2632.21,"to":2638.71,"location":2,"content":"they keep using the same network but they introduce a different objective at the top."},{"from":2638.71,"to":2641.71,"location":2,"content":"So, one thing you could do with this network is use"},{"from":2641.71,"to":2645.01,"location":2,"content":"it to predict the next word as a language model."},{"from":2645.01,"to":2646.46,"location":2,"content":"And so at this point,"},{"from":2646.46,"to":2649.82,"location":2,"content":"they freeze the parameters of that softmax at the top,"},{"from":2649.82,"to":2651.45,"location":2,"content":"that's why it's shown in black."},{"from":2651.45,"to":2654.93,"location":2,"content":"Um, but instead, they could stick on"},{"from":2654.93,"to":2659.82,"location":2,"content":"a different prediction unit where it's predicting stuff for a particular task."},{"from":2659.82,"to":2661.61,"location":2,"content":"So, it might be predicting"},{"from":2661.61,"to":2666.68,"location":2,"content":"positive or negative sentiment in a text classification task or something like that."},{"from":2666.68,"to":2667.76,"location":2,"content":"So, in their model,"},{"from":2667.76,"to":2671.91,"location":2,"content":"they're sort of reusing the same network but sticking on the top of that,"},{"from":2671.91,"to":2676.2,"location":2,"content":"a different layer, to do the new classification task."},{"from":2676.2,"to":2679.7,"location":2,"content":"Um, they were also interested in something small,"},{"from":2679.7,"to":2683.61,"location":2,"content":"the sort of one GPU model of research, um,"},{"from":2683.61,"to":2687.62,"location":2,"content":"the paper has a lot of detail, the sort of tricks"},{"from":2687.62,"to":2692.15,"location":2,"content":"and care and feeding of your neural models to maximize performance."},{"from":2692.15,"to":2696.24,"location":2,"content":"If you're interested in that, you could sort of look up some of the details about that."},{"from":2696.24,"to":2700.25,"location":2,"content":"Um, but what they were able to show again,"},{"from":2700.25,"to":2703.82,"location":2,"content":"was making use of this language model pre-training was"},{"from":2703.82,"to":2707.49,"location":2,"content":"a very effective way to improve performance,"},{"from":2707.49,"to":2709.86,"location":2,"content":"this time for text classification."},{"from":2709.86,"to":2712.52,"location":2,"content":"So, these are text classification datasets,"},{"from":2712.52,"to":2714.26,"location":2,"content":"IMDb is for sentiment,"},{"from":2714.26,"to":2718.97,"location":2,"content":"um, TREC is for topical text classification, and again,"},{"from":2718.97,"to":2722.78,"location":2,"content":"there are preceding systems that other people have developed and they"},{"from":2722.78,"to":2726.62,"location":2,"content":"are showing that by making use of this language model pre-training,"},{"from":2726.62,"to":2731.39,"location":2,"content":"they're able to significantly improve on the state of the art of these error rates,"},{"from":2731.39,"to":2733.9,"location":2,"content":"so that low is good."},{"from":2733.9,"to":2739.72,"location":2,"content":"They also showed another interesting result which is kind of,"},{"from":2739.72,"to":2744.39,"location":2,"content":"um, what you would expect or hope from doing this kind of transfer learning,"},{"from":2744.39,"to":2746.33,"location":2,"content":"that what they were able to show is,"},{"from":2746.33,"to":2751.2,"location":2,"content":"if you can train this neural language model on a big amount of data,"},{"from":2751.2,"to":2754.43,"location":2,"content":"that that means you will then be able to do well on"},{"from":2754.43,"to":2759.11,"location":2,"content":"your supervised task even when trained on pretty little data."},{"from":2759.11,"to":2761.78,"location":2,"content":"Um, so, here this is error rate,"},{"from":2761.78,"to":2763.36,"location":2,"content":"so low is good."},{"from":2763.36,"to":2765.17,"location":2,"content":"So, what the- and here's the number of"},{"from":2765.17,"to":2768.82,"location":2,"content":"training examples which has being done on a log scale."},{"from":2768.82,"to":2771.71,"location":2,"content":"And so the blue line is if you're just training"},{"from":2771.71,"to":2775.73,"location":2,"content":"a text classifier from scratch on supervised data."},{"from":2775.73,"to":2779.76,"location":2,"content":"So, you need a lot of data to start to do pretty well."},{"from":2779.76,"to":2784.72,"location":2,"content":"Um, but if you're making use of this transfer learning, um,"},{"from":2784.72,"to":2787.89,"location":2,"content":"from a pre-trained language model,"},{"from":2787.89,"to":2790.31,"location":2,"content":"you can get to that you're sort of doing pretty"},{"from":2790.31,"to":2793.7,"location":2,"content":"well with way less, um, training examples."},{"from":2793.7,"to":2795.89,"location":2,"content":"Essentially, an order of magnitude,"},{"from":2795.89,"to":2799.66,"location":2,"content":"less training examples will give you the same amount of performance."},{"from":2799.66,"to":2804.02,"location":2,"content":"And the difference between these two lines corresponds to the extra,"},{"from":2804.02,"to":2808.67,"location":2,"content":"um, phase that they had in the middle of theirs, um, which is,"},{"from":2808.67,"to":2813.92,"location":2,"content":"whether you're doing this sort of extra fine tuning on your target domain,"},{"from":2813.92,"to":2818.69,"location":2,"content":"um, it's part of your process and they found that to be pretty helpful."},{"from":2818.69,"to":2825.22,"location":2,"content":"Okay. So, that, um, is another precursor."},{"from":2825.22,"to":2831.55,"location":2,"content":"Um, and so, one big part of what has happened since then,"},{"from":2831.55,"to":2835.82,"location":2,"content":"is effectively people said this is a good idea, uh,"},{"from":2835.82,"to":2841.91,"location":2,"content":"maybe it'll become a really really good idea if we just make things way bigger."},{"from":2841.91,"to":2844.25,"location":2,"content":"Um, so, ULMfit, um,"},{"from":2844.25,"to":2848.05,"location":2,"content":"was something that you could train in one GPU day,"},{"from":2848.05,"to":2851.87,"location":2,"content":"sounds appealing for CS224N final projects,"},{"from":2851.87,"to":2854.93,"location":2,"content":"remember that, um, and but well,"},{"from":2854.93,"to":2859.11,"location":2,"content":"then the people at OpenAI decided, well,"},{"from":2859.11,"to":2863.3,"location":2,"content":"we could build a pretrain language model and train it on"},{"from":2863.3,"to":2867.59,"location":2,"content":"a much larger amount of data on a much larger amount of compute,"},{"from":2867.59,"to":2874.13,"location":2,"content":"and use about 242 GPU days and that will get a lot better, and it did."},{"from":2874.13,"to":2877.19,"location":2,"content":"Um, and then the people at Google said,"},{"from":2877.19,"to":2880.45,"location":2,"content":"well we could train a model, um,"},{"from":2880.45,"to":2884.66,"location":2,"content":"in to 256 TPU days,"},{"from":2884.66,"to":2887.64,"location":2,"content":"which means maybe about double the amount of computation."},{"from":2887.64,"to":2889.57,"location":2,"content":"It's hard to figure out exactly,"},{"from":2889.57,"to":2892.18,"location":2,"content":"and that might be able to do exciting things,"},{"from":2892.18,"to":2894.95,"location":2,"content":"and that was the BERT model, and it did."},{"from":2894.95,"to":2898.37,"location":2,"content":"Um, and then if you're following along these things, um,"},{"from":2898.37,"to":2900.11,"location":2,"content":"just last week, um,"},{"from":2900.11,"to":2902.27,"location":2,"content":"the OpenAI people said,"},{"from":2902.27,"to":2906.84,"location":2,"content":"well we can go much bigger again and we can train a model, um,"},{"from":2906.84,"to":2912.83,"location":2,"content":"for approximately 2,000 TPU version three days."},{"from":2912.83,"to":2916.34,"location":2,"content":"Um, and it will be able to,"},{"from":2916.34,"to":2919.29,"location":2,"content":"um, do much bigger again,"},{"from":2919.29,"to":2921.08,"location":2,"content":"a bit much better again,"},{"from":2921.08,"to":2924.41,"location":2,"content":"um, and so, this is this GP2,"},{"from":2924.41,"to":2927.8,"location":2,"content":"GPT-2 language model, um,"},{"from":2927.8,"to":2930.68,"location":2,"content":"which OpenAI released last week."},{"from":2930.68,"to":2936.74,"location":2,"content":"Um, and they're, they're actually very impressive results, um,"},{"from":2936.74,"to":2940.73,"location":2,"content":"when they're showing that if you're sort of building a really,"},{"from":2940.73,"to":2945.16,"location":2,"content":"really huge language model over a very large amount of data."},{"from":2945.16,"to":2949.74,"location":2,"content":"And then you say language model go off and generate some text,"},{"from":2949.74,"to":2951.8,"location":2,"content":"on this particular topic,"},{"from":2951.8,"to":2955.1,"location":2,"content":"that it can actually just do a great job of producing text."},{"from":2955.1,"to":2957.13,"location":2,"content":"So, the way this was being do- done,"},{"from":2957.13,"to":2959.93,"location":2,"content":"was a humanist writing a couple of sentences;"},{"from":2959.93,"to":2961.19,"location":2,"content":"in a shocking finding,"},{"from":2961.19,"to":2963.51,"location":2,"content":"scientists discovered a herd of unicorns,"},{"from":2963.51,"to":2967.7,"location":2,"content":"living in remote previously unexplored valley in the Andes Mountains."},{"from":2967.7,"to":2969.91,"location":2,"content":"Um, and so, we then,"},{"from":2969.91,"to":2973.7,"location":2,"content":"using our neural language model and chugging through that,"},{"from":2973.7,"to":2975.68,"location":2,"content":"so that gives us context,"},{"from":2975.68,"to":2977.76,"location":2,"content":"and then say generate more text,"},{"from":2977.76,"to":2979.76,"location":2,"content":"and it starts to generate the scientist"},{"from":2979.76,"to":2982.16,"location":2,"content":"named the population after their distinctive horn,"},{"from":2982.16,"to":2984.32,"location":2,"content":"Ovid's Unicorn, these four-horned,"},{"from":2984.32,"to":2987.82,"location":2,"content":"silver-white Uni four corns were previously unknown to science."},{"from":2987.82,"to":2990.08,"location":2,"content":"Um, it produces remarkably,"},{"from":2990.08,"to":2992.74,"location":2,"content":"um, good text or at least in the,"},{"from":2992.74,"to":2997.22,"location":2,"content":"in the hand-picked examples [LAUGHTER] that they showed in the tech news,"},{"from":2997.22,"to":2999.92,"location":2,"content":"um, it produces extremely good text."},{"from":2999.92,"to":3004.96,"location":2,"content":"Um, yeah so, I think one should be a little bit cautious about, um,"},{"from":3004.96,"to":3007.93,"location":2,"content":"that and sort of some of its random outputs actually"},{"from":3007.93,"to":3010.9,"location":2,"content":"aren't nearly as good but nevertheless you know,"},{"from":3010.9,"to":3012.89,"location":2,"content":"I think is is actually dramatic"},{"from":3012.89,"to":3016.54,"location":2,"content":"how good language models are becoming once you are training"},{"from":3016.54,"to":3023.28,"location":2,"content":"them on long contexts as we can do with modern models on vast amounts of data, um-."},{"from":3023.28,"to":3027.43,"location":2,"content":"So then, um, the OpenAI people decided"},{"from":3027.43,"to":3031.72,"location":2,"content":"this language model was so good that they weren't gonna release it to the world, um,"},{"from":3031.72,"to":3034.48,"location":2,"content":"which then got transformed into headlines of,"},{"from":3034.48,"to":3039.26,"location":2,"content":"Elon Musk's OpenAI builds artificial intelligence so powerful,"},{"from":3039.26,"to":3041.98,"location":2,"content":"it must be kept locked up for the good of humanity."},{"from":3041.98,"to":3046.66,"location":2,"content":"[LAUGHTER] Um, with the suitable pictures that always turn off at"},{"from":3046.66,"to":3052.07,"location":2,"content":"these moments down the bottom of the screen, um, and,"},{"from":3052.07,"to":3057.52,"location":2,"content":"um, yeah I guess that was the leading even Elon Musk to be wanting to clarify and say"},{"from":3057.52,"to":3063.02,"location":2,"content":"that it's not actually really that he's directing what's happening at OpenAI anymore."},{"from":3063.02,"to":3066.36,"location":2,"content":"Um, anyway, moving right along."},{"from":3066.36,"to":3069.76,"location":2,"content":"Um, so, part of the story here is"},{"from":3069.76,"to":3074.64,"location":2,"content":"just a scaling thing that these things have been getting bigger and bigger,"},{"from":3074.64,"to":3078.76,"location":2,"content":"um, but the other part of the story is that all three of"},{"from":3078.76,"to":3083.78,"location":2,"content":"these are then systems that use the transformer architecture."},{"from":3083.78,"to":3087.7,"location":2,"content":"And transformer architectures have not only being very powerful,"},{"from":3087.7,"to":3092.57,"location":2,"content":"but technically had allowed scaling to much bigger sizes."},{"from":3092.57,"to":3095.57,"location":2,"content":"So to understand some of the rest of these, um,"},{"from":3095.57,"to":3099.05,"location":2,"content":"we should learn more about transformers."},{"from":3099.05,"to":3102.61,"location":2,"content":"And so, I'm sort of gonna do that, um,"},{"from":3102.61,"to":3106.49,"location":2,"content":"but I mean, um, in mix of orders,"},{"from":3106.49,"to":3110.2,"location":2,"content":"um, our invited speaker coming Thursday uh, is, um,"},{"from":3110.2,"to":3112.42,"location":2,"content":"one of the authors of the transformer paper,"},{"from":3112.42,"to":3114.49,"location":2,"content":"and he's gonna talk about transformers."},{"from":3114.49,"to":3117.43,"location":2,"content":"So I think what I'm gonna do is, um,"},{"from":3117.43,"to":3121,"location":2,"content":"say a little bit about transformers quickly,"},{"from":3121,"to":3124.09,"location":2,"content":"but not really dwell on all the details, um,"},{"from":3124.09,"to":3126.26,"location":2,"content":"but hope that it's a bit of an introduction,"},{"from":3126.26,"to":3130.36,"location":2,"content":"and you can find out more on Thursday about the details and"},{"from":3130.36,"to":3135.19,"location":2,"content":"then talk some more about the BERT model before finishing."},{"from":3135.19,"to":3139.45,"location":2,"content":"So the motivation for transformers is essentially"},{"from":3139.45,"to":3143.24,"location":2,"content":"we want things to go faster so we can build bigger models,"},{"from":3143.24,"to":3146.13,"location":2,"content":"and the problem as we mentioned for these, um,"},{"from":3146.13,"to":3151.06,"location":2,"content":"LSTM or in general any of the recurrent models is the fact that they're recurrent."},{"from":3151.06,"to":3156.19,"location":2,"content":"You have to generate sort of one to n status time chugging through,"},{"from":3156.19,"to":3161.28,"location":2,"content":"and that means you just can't do the same kind of parallel computation, um,"},{"from":3161.28,"to":3166.97,"location":2,"content":"that GPUs love that you can do in things like convolutional neural networks."},{"from":3166.97,"to":3168.86,"location":2,"content":"But, you know, on the other hand,"},{"from":3168.86,"to":3171.21,"location":2,"content":"we discovered that even though, um,"},{"from":3171.21,"to":3176.01,"location":2,"content":"these gated recurrent units like LSTMs and GRUs are great,"},{"from":3176.01,"to":3180.07,"location":2,"content":"that to get really great performance out of these recurrent models,"},{"from":3180.07,"to":3185.68,"location":2,"content":"we found that we wanted to- we had a problem within these long sequence lengths,"},{"from":3185.68,"to":3189.01,"location":2,"content":"and we can improve things by adding attention mechanisms."},{"from":3189.01,"to":3192.07,"location":2,"content":"And so that led to the idea of- well,"},{"from":3192.07,"to":3194.43,"location":2,"content":"since attention works so great,"},{"from":3194.43,"to":3197.44,"location":2,"content":"maybe we can just use attention,"},{"from":3197.44,"to":3202.2,"location":2,"content":"and we can actually get rid of the recurrent part of the model [NOISE] altogether."},{"from":3202.2,"to":3207.63,"location":2,"content":"And so that actually then leads to the idea of these transformer architectures,"},{"from":3207.63,"to":3212.55,"location":2,"content":"and the original paper on this is actually called attention is all you need,"},{"from":3212.55,"to":3216.7,"location":2,"content":"which reflects this idea of we're gonna keep the attention part,"},{"from":3216.7,"to":3220,"location":2,"content":"and we're getting- going to get rid of the, um,"},{"from":3220,"to":3223.96,"location":2,"content":"recurrent part, and we'll be able to build a great model."},{"from":3223.96,"to":3225.31,"location":2,"content":"So in the initial work,"},{"from":3225.31,"to":3228.79,"location":2,"content":"what they're doing is machine translation kind of like"},{"from":3228.79,"to":3232.72,"location":2,"content":"the Neural Machine Translation with attention we described,"},{"from":3232.72,"to":3236.18,"location":2,"content":"but what they're wanting to do is build"},{"from":3236.18,"to":3243.63,"location":2,"content":"a complex encoder and a complex decoder that works non-recurrently,"},{"from":3243.63,"to":3247.66,"location":2,"content":"and, um, nevertheless is able to translate sentences"},{"from":3247.66,"to":3253.07,"location":2,"content":"well by making use of lots of attention distributions."},{"from":3253.07,"to":3258.07,"location":2,"content":"And so, I wanted to say a little bit more quickly about that,"},{"from":3258.07,"to":3260.97,"location":2,"content":"and hopefully we'll get more of this on Thursday."},{"from":3260.97,"to":3264.68,"location":2,"content":"Um, first as a- as a recommended resource,"},{"from":3264.68,"to":3266.55,"location":2,"content":"if you wanna look at, um,"},{"from":3266.55,"to":3269.7,"location":2,"content":"home and learn more about, um,"},{"from":3269.7,"to":3274,"location":2,"content":"the transformer architecture, there's this really great, um,"},{"from":3274,"to":3279.1,"location":2,"content":"bit of work by Sasha Rush called The Annotated Transformer that goes through"},{"from":3279.1,"to":3285.03,"location":2,"content":"the entire transformer paper accompanied by PyTorch code in a Jupyter Notebook,"},{"from":3285.03,"to":3288.22,"location":2,"content":"and so that can actually be a really useful thing,"},{"from":3288.22,"to":3294.24,"location":2,"content":"but I'll go through a little bit of the basics now of how we do things."},{"from":3294.24,"to":3297.46,"location":2,"content":"So the basic idea, um,"},{"from":3297.46,"to":3303.39,"location":2,"content":"is that they're going to use attention everywhere to calculate things."},{"from":3303.39,"to":3307.54,"location":2,"content":"And, um, we talked before about the different kinds of"},{"from":3307.54,"to":3312.52,"location":2,"content":"attention of the sort of multiplicative by linear attention and the little,"},{"from":3312.52,"to":3315.49,"location":2,"content":"um, feed-forward network additive attention."},{"from":3315.49,"to":3318.67,"location":2,"content":"They kind of go for the simplest kind of attention,"},{"from":3318.67,"to":3323.03,"location":2,"content":"where the attention is just dot-products between two things."},{"from":3323.03,"to":3326.86,"location":2,"content":"Um, but they sort of do the more comp- for various purposes,"},{"from":3326.86,"to":3332.83,"location":2,"content":"they do the more complicated version of dot-product between two things where they have,"},{"from":3332.83,"to":3336.28,"location":2,"content":"um, when the- the things that they're looking up are"},{"from":3336.28,"to":3340.38,"location":2,"content":"assumed to be key-value pairs, keys and values,"},{"from":3340.38,"to":3346.76,"location":2,"content":"and so you're calculating the similarity as a dot-product between a query and the key,"},{"from":3346.76,"to":3348.41,"location":2,"content":"and then based on that,"},{"from":3348.41,"to":3352.06,"location":2,"content":"you're going to be using the vector for the corresponding value."},{"from":3352.06,"to":3355.8,"location":2,"content":"So our equation here for what we're calculating is where you are"},{"from":3355.8,"to":3360.13,"location":2,"content":"looking using the softmax over query, um,"},{"from":3360.13,"to":3363.61,"location":2,"content":"key similarities and using that to give"},{"from":3363.61,"to":3368.68,"location":2,"content":"the weightings as an attention based weighting over the corresponding values."},{"from":3368.68,"to":3372.22,"location":2,"content":"Um, so that's the basic attention model."},{"from":3372.22,"to":3375.99,"location":2,"content":"Um, so that add- saying it that way, um,"},{"from":3375.99,"to":3378.1,"location":2,"content":"adds a little bit of complexity,"},{"from":3378.1,"to":3381.14,"location":2,"content":"but sort of for the simplest part for their encoder."},{"from":3381.14,"to":3386.07,"location":2,"content":"Actually, all of the query keys and values are exactly the same."},{"from":3386.07,"to":3388.22,"location":2,"content":"They are the words, um,"},{"from":3388.22,"to":3392.62,"location":2,"content":"that they're using as their source language, um, things."},{"from":3392.62,"to":3398.34,"location":2,"content":"So, it sort of adds some complexity that isn't really there."},{"from":3398.34,"to":3402.28,"location":2,"content":"Um, okay. Um, I'll skip that."},{"from":3402.28,"to":3408.18,"location":2,"content":"Um, so, there are a couple of other things that they do."},{"from":3408.18,"to":3412.16,"location":2,"content":"One thing that they note is that, um,"},{"from":3412.16,"to":3417.74,"location":2,"content":"the- the values you get from, um, QTK, um,"},{"from":3417.74,"to":3423.28,"location":2,"content":"very, in variances the dimension gets large"},{"from":3423.28,"to":3428.23,"location":2,"content":"so that they sort of do some normalization by the size of the hidden state dimension,"},{"from":3428.23,"to":3432.28,"location":2,"content":"but I'll leave that out as well for details, right."},{"from":3432.28,"to":3433.95,"location":2,"content":"So in the encoder, um,"},{"from":3433.95,"to":3437.02,"location":2,"content":"everything is just our word vectors,"},{"from":3437.02,"to":3440.38,"location":2,"content":"there are the queries, the keys, and the values."},{"from":3440.38,"to":3443.78,"location":2,"content":"Um, and we're gonna use attention everywhere in the system."},{"from":3443.78,"to":3449.86,"location":2,"content":"Oops. Okay. So the second new idea is, well,"},{"from":3449.86,"to":3456.11,"location":2,"content":"attention is great but maybe it's bad if you only have one attention distribution,"},{"from":3456.11,"to":3459.19,"location":2,"content":"because you're gonna only attend to things one way."},{"from":3459.19,"to":3462.41,"location":2,"content":"Maybe for various users it would be great"},{"from":3462.41,"to":3465.76,"location":2,"content":"if you could attend from one position to various things."},{"from":3465.76,"to":3471.19,"location":2,"content":"So, if you're thinking about syntax and what we did with dependency parsers."},{"from":3471.19,"to":3474.97,"location":2,"content":"If you're a word, you might want to attend to your headword,"},{"from":3474.97,"to":3479.16,"location":2,"content":"but you might also wanna attend- attend to your dependent words."},{"from":3479.16,"to":3481.69,"location":2,"content":"And if you happen to be a pronoun,"},{"from":3481.69,"to":3486.01,"location":2,"content":"you might want to attend to what the pronoun refers to you."},{"from":3486.01,"to":3487.86,"location":2,"content":"You might want to have lots of attention."},{"from":3487.86,"to":3492.01,"location":2,"content":"So they introduced this idea of multi-head attention."},{"from":3492.01,"to":3496.36,"location":2,"content":"And so what you're doing with multi-head attention is you have,"},{"from":3496.36,"to":3498.13,"location":2,"content":"um, your hidden states,"},{"from":3498.13,"to":3500.17,"location":2,"content":"um, in your system,"},{"from":3500.17,"to":3503.8,"location":2,"content":"and you map them via projection layers, um,"},{"from":3503.8,"to":3507.67,"location":2,"content":"which are just multiplications by different W matrices as"},{"from":3507.67,"to":3512.35,"location":2,"content":"linear projections into sort of different lower dimensional spaces,"},{"from":3512.35,"to":3517.03,"location":2,"content":"and then you use each of those to calculate dot-product attention,"},{"from":3517.03,"to":3520.27,"location":2,"content":"and so you can attend to different things at the same time."},{"from":3520.27,"to":3522.67,"location":2,"content":"And this multi-head attention was one of"},{"from":3522.67,"to":3528.66,"location":2,"content":"the very successful ideas of transformers that made them a more powerful architecture."},{"from":3528.66,"to":3534.72,"location":2,"content":"Okay. Um, so, then for our complete transformer block,"},{"from":3534.72,"to":3540.51,"location":2,"content":"it's sort of then starting to build complex architectures like we sort of started seeing,"},{"from":3540.51,"to":3542.2,"location":2,"content":"um, the other week."},{"from":3542.2,"to":3545.32,"location":2,"content":"Um, so- okay."},{"from":3545.32,"to":3546.97,"location":2,"content":"Yeah. So, starting,"},{"from":3546.97,"to":3550.06,"location":2,"content":"um, from our word vectors,"},{"from":3550.06,"to":3556.91,"location":2,"content":"we're kind of going to do attention to multiple different things,"},{"from":3556.91,"to":3559.9,"location":2,"content":"um, and we're simultaneously gonna have"},{"from":3559.9,"to":3563.53,"location":2,"content":"a residual connection that short-circuits around them."},{"from":3563.53,"to":3568.05,"location":2,"content":"Um, we're then going to sort of sum the two of these,"},{"from":3568.05,"to":3573.11,"location":2,"content":"and then they're going to do a normalization at that point."},{"from":3573.11,"to":3576.4,"location":2,"content":"Um, I talked previously about batch normalization,"},{"from":3576.4,"to":3578.02,"location":2,"content":"they don't do batch normalization,"},{"from":3578.02,"to":3581.2,"location":2,"content":"they do another variant which is layer normalization,"},{"from":3581.2,"to":3583.86,"location":2,"content":"which is a different way of doing normalization,"},{"from":3583.86,"to":3585.63,"location":2,"content":"but I'll skip that for now."},{"from":3585.63,"to":3589,"location":2,"content":"And then they sort of for one transformer block,"},{"from":3589,"to":3592.05,"location":2,"content":"you then go after the multi-head attention,"},{"from":3592.05,"to":3596.76,"location":2,"content":"you put things through a feed-forward layer which also has a residual connection,"},{"from":3596.76,"to":3598.81,"location":2,"content":"you sum the output of those,"},{"from":3598.81,"to":3603.79,"location":2,"content":"and you then again do another, um, layer normalization."},{"from":3603.79,"to":3608.97,"location":2,"content":"So this is the basic transformer block that they're gonna use everywhere."},{"from":3608.97,"to":3611.32,"location":2,"content":"And to make their complete architectures,"},{"from":3611.32,"to":3613.21,"location":2,"content":"they're then gonna sort of start stacking"},{"from":3613.21,"to":3617.05,"location":2,"content":"these transformer blocks to produce a very deep network."},{"from":3617.05,"to":3618.16,"location":2,"content":"And in some sense,"},{"from":3618.16,"to":3622.78,"location":2,"content":"what has been found is that transformers performed very well."},{"from":3622.78,"to":3625,"location":2,"content":"But, you know, there's no free lunch,"},{"from":3625,"to":3626.44,"location":2,"content":"um, you kind of can't."},{"from":3626.44,"to":3628.15,"location":2,"content":"You're- now, no longer getting"},{"from":3628.15,"to":3631.45,"location":2,"content":"recurrent information actually being carried along a sequence."},{"from":3631.45,"to":3636.28,"location":2,"content":"You've got a word at some position which can be casting attention,"},{"from":3636.28,"to":3638.03,"location":2,"content":"uh, on other words."},{"from":3638.03,"to":3641.56,"location":2,"content":"So if you'd like to have information carried along in a chain,"},{"from":3641.56,"to":3644.98,"location":2,"content":"you've sort of first of all gotta walk the first step of the chain,"},{"from":3644.98,"to":3646.69,"location":2,"content":"and then you need to have another layer"},{"from":3646.69,"to":3649.69,"location":2,"content":"vertically which can walk the next step of the chain,"},{"from":3649.69,"to":3653.8,"location":2,"content":"and then you need to have another layer vertically that walks the next step of the chain."},{"from":3653.8,"to":3657.52,"location":2,"content":"So, you're getting rid of the recurrence along the sequence,"},{"from":3657.52,"to":3663.22,"location":2,"content":"but you're substituting some depth to allow things to walk along multiple hops."},{"from":3663.22,"to":3667.89,"location":2,"content":"But nevertheless, that's highly advantageous in GPU architectures"},{"from":3667.89,"to":3673.3,"location":2,"content":"because it allows you to use parallelization to calculate everything at each,"},{"from":3673.3,"to":3679.29,"location":2,"content":"um, depth at the same time. Um."},{"from":3679.29,"to":3682.9,"location":2,"content":"Maybe I'll go light on explaining this as well."},{"from":3682.9,"to":3685.42,"location":2,"content":"Um, so they use byte-pair encodings."},{"from":3685.42,"to":3687.49,"location":2,"content":"But if you do nothing else,"},{"from":3687.49,"to":3690.85,"location":2,"content":"you just have words fed in this word vectors and you have"},{"from":3690.85,"to":3694.76,"location":2,"content":"no idea whether you're at the beginning of the sentence or at the end of the sentence."},{"from":3694.76,"to":3698.68,"location":2,"content":"Though, they have a message of- method of doing positional encoding which gives"},{"from":3698.68,"to":3702.86,"location":2,"content":"you some ideas to pro- position your word has in the sentence."},{"from":3702.86,"to":3707.95,"location":2,"content":"Okay. Um, so that's sort of the, um, encoder system."},{"from":3707.95,"to":3709.54,"location":2,"content":"So from the words,"},{"from":3709.54,"to":3711.55,"location":2,"content":"they have an initial word embedding,"},{"from":3711.55,"to":3714.09,"location":2,"content":"you add in their positional encoding,"},{"from":3714.09,"to":3718.11,"location":2,"content":"you go into one of these transformer blocks,"},{"from":3718.11,"to":3721.03,"location":2,"content":"and you then repeat it n times."},{"from":3721.03,"to":3723.84,"location":2,"content":"So you'll have a stack of these transformer blocks."},{"from":3723.84,"to":3726.78,"location":2,"content":"So you're multiple times doing, um,"},{"from":3726.78,"to":3731.59,"location":2,"content":"multi-head attention to other parts of the sentence, calculating values,"},{"from":3731.59,"to":3732.94,"location":2,"content":"feeding forward a value,"},{"from":3732.94,"to":3734.86,"location":2,"content":"putting it through a fully-connected layer,"},{"from":3734.86,"to":3739.74,"location":2,"content":"and then you just sort of repeat, do attention to different places in the sentence."},{"from":3739.74,"to":3741.31,"location":2,"content":"Get all your information,"},{"from":3741.31,"to":3743.28,"location":2,"content":"put it through a fully connected layer,"},{"from":3743.28,"to":3746.76,"location":2,"content":"and go up, um, proceeding up deeply."},{"from":3746.76,"to":3751,"location":2,"content":"And and that sounds a little mysterious,"},{"from":3751,"to":3754.22,"location":2,"content":"but it turns out to work just great."},{"from":3754.22,"to":3756.6,"location":2,"content":"And the way to think about,"},{"from":3756.6,"to":3759.9,"location":2,"content":"I think is that at each stage,"},{"from":3759.9,"to":3764.76,"location":2,"content":"you can look with your multi-headed attention and various other places in the sentence,"},{"from":3764.76,"to":3768.21,"location":2,"content":"accumulate information, push it up to the next layer."},{"from":3768.21,"to":3771.26,"location":2,"content":"And if you do that sort of half a dozen times,"},{"from":3771.26,"to":3775.53,"location":2,"content":"you can be starting to progressively push information along"},{"from":3775.53,"to":3781.45,"location":2,"content":"the sequence in either direction to calculate values that are of interest."},{"from":3781.45,"to":3788.61,"location":2,"content":"Um, and the interesting thing is that these models turn out to work"},{"from":3788.61,"to":3795.97,"location":2,"content":"really well at sort of learning to attend the interesting things in linguistic structure."},{"from":3795.97,"to":3799.81,"location":2,"content":"Um, so these are just sort of suggestive diagrams,"},{"from":3799.81,"to":3804.19,"location":2,"content":"but this is looking at layer five of the transformer stack and"},{"from":3804.19,"to":3808.95,"location":2,"content":"seeing what words are being attended to by different attention heads."},{"from":3808.95,"to":3813.01,"location":2,"content":"So these different colors correspond to different attention heads."},{"from":3813.01,"to":3815.05,"location":2,"content":"And so the sentence is,"},{"from":3815.05,"to":3819.01,"location":2,"content":"um, it is, \"In this spirit,"},{"from":3819.01,"to":3822.31,"location":2,"content":"that a majority of American governments have passed new laws since"},{"from":3822.31,"to":3827.06,"location":2,"content":"2009 making the registration or voting process more difficult.\""},{"from":3827.06,"to":3833.28,"location":2,"content":"And so what we see is sort of most of the attention heads,"},{"from":3833.28,"to":3838.84,"location":2,"content":"uh, looking from making to making more difficult and that seems to be useful."},{"from":3838.84,"to":3843.7,"location":2,"content":"One of the attention heads seems to be looking at the word itself might be okay."},{"from":3843.7,"to":3850.57,"location":2,"content":"Um, then the other ones are sort of looking a bit at laws and at 2009."},{"from":3850.57,"to":3854.53,"location":2,"content":"So it's sort of picking out the arguments, um,"},{"from":3854.53,"to":3858.91,"location":2,"content":"and modifiers and making in a syntax kind of like way."},{"from":3858.91,"to":3861.88,"location":2,"content":"Um, interestingly, for pronouns,"},{"from":3861.88,"to":3866.77,"location":2,"content":"attention heads appear to learn to be able to look back to reference."},{"from":3866.77,"to":3868.8,"location":2,"content":"So the law will never be perfect,"},{"from":3868.8,"to":3875.18,"location":2,"content":"but its application should be just that one attention head it for its,"},{"from":3875.18,"to":3879.05,"location":2,"content":"is looking at what its is modifying in the application."},{"from":3879.05,"to":3880.93,"location":2,"content":"But another attention head,"},{"from":3880.93,"to":3885.64,"location":2,"content":"the its is looking strongly at what its refers back to as the law."},{"from":3885.64,"to":3887.74,"location":2,"content":"So that seems kind of cool."},{"from":3887.74,"to":3889.81,"location":2,"content":"Um, yeah."},{"from":3889.81,"to":3892.87,"location":2,"content":"Um, okay."},{"from":3892.87,"to":3896.03,"location":2,"content":"And so then, for the rest of the model, um,"},{"from":3896.03,"to":3898.99,"location":2,"content":"there's then some more complexity for how to use"},{"from":3898.99,"to":3905.02,"location":2,"content":"the transformers decoder to give you a full neural machine translation system."},{"from":3905.02,"to":3908.77,"location":2,"content":"But I think maybe I will skip that and go"},{"from":3908.77,"to":3913.75,"location":2,"content":"on and say a bit about BERT in my remaining minutes."},{"from":3913.75,"to":3918.49,"location":2,"content":"Okay. So, um, the latest and greatest contextual"},{"from":3918.49,"to":3923.59,"location":2,"content":"word representations to help you flow your tasks have been these BERT vectors,"},{"from":3923.59,"to":3929.97,"location":2,"content":"where BERT is Bidirectional Encoder Representations from Transformers."},{"from":3929.97,"to":3935.09,"location":2,"content":"And so essentially, it's using the encoder from a transformer network."},{"from":3935.09,"to":3940.2,"location":2,"content":"Uh, this deep multi-headed attention stack to calculate, um,"},{"from":3940.2,"to":3943.61,"location":2,"content":"a representation of a sentence and saying,"},{"from":3943.61,"to":3949.75,"location":2,"content":"\"That's a great all-purpose representation of a sentence that you can use for tasks."},{"from":3949.75,"to":3954.05,"location":2,"content":"Be it named entity recognition or SQuAD question answering.\""},{"from":3954.05,"to":3959.32,"location":2,"content":"And so there's actually an interesting new idea that these people had."},{"from":3959.32,"to":3964.99,"location":2,"content":"And that well, their idea was well standard language models are"},{"from":3964.99,"to":3968.23,"location":2,"content":"unidirectional and that's useful"},{"from":3968.23,"to":3971.76,"location":2,"content":"because it gives you a probability distribution of a language model."},{"from":3971.76,"to":3976.21,"location":2,"content":"But it's bad because you'd like to be able to do"},{"from":3976.21,"to":3981.19,"location":2,"content":"prediction from both sides to understand word meaning and context."},{"from":3981.19,"to":3983.72,"location":2,"content":"There's a second choice, um,"},{"from":3983.72,"to":3989.18,"location":2,"content":"which is you can kind of do bidirectional models when you incorporate,"},{"from":3989.18,"to":3991.7,"location":2,"content":"um, information in both ways."},{"from":3991.7,"to":3995.05,"location":2,"content":"But that sort of has problems as well,"},{"from":3995.05,"to":3997.48,"location":2,"content":"because then you get crosstalk."},{"from":3997.48,"to":4000.61,"location":2,"content":"Um, and so if you run a BiLSTM,"},{"from":4000.61,"to":4003.09,"location":2,"content":"and then you merge the representations by"},{"from":4003.09,"to":4006.76,"location":2,"content":"concatenation and then feed them into the next layer."},{"from":4006.76,"to":4008.66,"location":2,"content":"When you're running the next layer,"},{"from":4008.66,"to":4011.43,"location":2,"content":"the forward LSTM will have already gotten"},{"from":4011.43,"to":4014.39,"location":2,"content":"information about the future from the first layer."},{"from":4014.39,"to":4016.55,"location":2,"content":"Um, so it sort of, um,"},{"from":4016.55,"to":4020.49,"location":2,"content":"ends up with words that have already seen the future themselves."},{"from":4020.49,"to":4023.68,"location":2,"content":"So you have this sort of complex non-generative model."},{"from":4023.68,"to":4028.01,"location":2,"content":"Um, so somehow, they wanted to do things a bit differently,"},{"from":4028.01,"to":4033.6,"location":2,"content":"so they can have bidirectional context without words being able to see themselves."},{"from":4033.6,"to":4036.91,"location":2,"content":"And the idea that they came up with is well,"},{"from":4036.91,"to":4041.43,"location":2,"content":"we're gonna train things with a transformer encoder."},{"from":4041.43,"to":4046.51,"location":2,"content":"But what we're gonna do is mask out some of the words in the sentence,"},{"from":4046.51,"to":4050.16,"location":2,"content":"like, maybe we'll mask here store and gallon."},{"from":4050.16,"to":4054.18,"location":2,"content":"And then, so our language mod- our language modelling like"},{"from":4054.18,"to":4056.13,"location":2,"content":"objective will no longer be"},{"from":4056.13,"to":4060.09,"location":2,"content":"a true language model that's sort of generating a probability of a sentence,"},{"from":4060.09,"to":4063.7,"location":2,"content":"um, which is standardly done by working from left to right,"},{"from":4063.7,"to":4069.39,"location":2,"content":"but it will instead be a Mad Libs style fill in the blank objective."},{"from":4069.39,"to":4072.12,"location":2,"content":"So you'll see this context,"},{"from":4072.12,"to":4073.8,"location":2,"content":"which will be literally,"},{"from":4073.8,"to":4076.97,"location":2,"content":"\"The man went to the mask to buy a mask of milk.\""},{"from":4076.97,"to":4080.79,"location":2,"content":"And your, what's your training objective is to say,"},{"from":4080.79,"to":4083.43,"location":2,"content":"try and predict what this word is,"},{"from":4083.43,"to":4088.03,"location":2,"content":"which you can do with a cross entropy loss to the extent that you don't guess store."},{"from":4088.03,"to":4092.88,"location":2,"content":"And then, it will be trying to guess what this word is and you want to let guess gallon."},{"from":4092.88,"to":4094.99,"location":2,"content":"So you're training a model,"},{"from":4094.99,"to":4097.92,"location":2,"content":"um, to fill in these blanks."},{"from":4097.92,"to":4102.84,"location":2,"content":"Um, and the rate at which they blank words is essentially one word in seven,"},{"from":4102.84,"to":4105.23,"location":2,"content":"and they discuss how this is a trade-off."},{"from":4105.23,"to":4108.54,"location":2,"content":"Because if you blank too few words,"},{"from":4108.54,"to":4110.7,"location":2,"content":"it gets very expensive to train."},{"from":4110.7,"to":4112.59,"location":2,"content":"And if you blank many words,"},{"from":4112.59,"to":4115.55,"location":2,"content":"well you've blanked out most of the context of a word,"},{"from":4115.55,"to":4118.06,"location":2,"content":"and that means it's not very useful for training,"},{"from":4118.06,"to":4122.32,"location":2,"content":"and they found about sort of one in seven seemed to work pretty well for them."},{"from":4122.32,"to":4126.59,"location":2,"content":"But what they want to argue is, um,"},{"from":4126.59,"to":4131.22,"location":2,"content":"that for the OpenAI's GPT,"},{"from":4131.22,"to":4133.47,"location":2,"content":"which is also a transformer model."},{"from":4133.47,"to":4136.85,"location":2,"content":"It's a sort of a classic language model working from"},{"from":4136.85,"to":4140.7,"location":2,"content":"left to right and so you only get left context."},{"from":4140.7,"to":4143.81,"location":2,"content":"Um, for the BERT language model,"},{"from":4143.81,"to":4147.28,"location":2,"content":"sorry, the ELMo language model that's shown up at the top."},{"from":4147.28,"to":4151.68,"location":2,"content":"Um, well, they're running a left to right language model and they're running,"},{"from":4151.68,"to":4153.99,"location":2,"content":"um, right to left language models."},{"from":4153.99,"to":4156.03,"location":2,"content":"So in some sense, um,"},{"from":4156.03,"to":4158.3,"location":2,"content":"they have context from both sides."},{"from":4158.3,"to":4162.69,"location":2,"content":"But these two language models are trained completely independently"},{"from":4162.69,"to":4167.27,"location":2,"content":"and then you're just sort of concatenating their representations, um, together."},{"from":4167.27,"to":4172.17,"location":2,"content":"So there's no sense in which we're actually kind of having a model that's jointly"},{"from":4172.17,"to":4177.93,"location":2,"content":"using context from both sides at the time though that the pre-trained,"},{"from":4177.93,"to":4180.93,"location":2,"content":"um, contextual word representations are built."},{"from":4180.93,"to":4185.94,"location":2,"content":"So their hope is using inside a transformer model"},{"from":4185.94,"to":4187.98,"location":2,"content":"this trick of blanking out words,"},{"from":4187.98,"to":4193.29,"location":2,"content":"and predicting it using the entire context will allow them to use two-sided context,"},{"from":4193.29,"to":4195.54,"location":2,"content":"and be much more effective."},{"from":4195.54,"to":4200.02,"location":2,"content":"And that's what they seem to show, um."},{"from":4200.02,"to":4203.84,"location":2,"content":"There's one other complication and,"},{"from":4203.84,"to":4205.48,"location":2,"content":"I mean, I'll show later."},{"from":4205.48,"to":4209.84,"location":2,"content":"Um, this last complication is a bit useful,"},{"from":4209.84,"to":4213,"location":2,"content":"but it's sort of not really essential to their main idea,"},{"from":4213,"to":4214.85,"location":2,"content":"was that they thought,"},{"from":4214.85,"to":4218.55,"location":2,"content":"one of the, one of the goals in their head was clearly to be able to"},{"from":4218.55,"to":4222.66,"location":2,"content":"have this be useful for things like question answering,"},{"from":4222.66,"to":4225.08,"location":2,"content":"um, tasks, or, um,"},{"from":4225.08,"to":4226.77,"location":2,"content":"natural language inference tasks,"},{"from":4226.77,"to":4230.64,"location":2,"content":"and their relationships between, um, two sentences."},{"from":4230.64,"to":4232.26,"location":2,"content":"So, their idea was, well,"},{"from":4232.26,"to":4236.43,"location":2,"content":"one good objective is this fill in the blank word objective which is,"},{"from":4236.43,"to":4239.09,"location":2,"content":"sort of, like language modeling objective."},{"from":4239.09,"to":4242.31,"location":2,"content":"But they thought it would be useful to have a second objective"},{"from":4242.31,"to":4245.93,"location":2,"content":"where you're predicting relationships between sentences."},{"from":4245.93,"to":4251.41,"location":2,"content":"So, they secondly have a loss function which is, um,"},{"from":4251.41,"to":4254.67,"location":2,"content":"let's have two sentences where"},{"from":4254.67,"to":4258.36,"location":2,"content":"the sentences might be two successive sentences in the text,"},{"from":4258.36,"to":4262.65,"location":2,"content":"or a sentence followed by a random sentence from somewhere else."},{"from":4262.65,"to":4266.48,"location":2,"content":"And we want to train the system to predict when you've,"},{"from":4266.48,"to":4270.93,"location":2,"content":"seeing an- a correct next sentence versus a random sentence."},{"from":4270.93,"to":4276.33,"location":2,"content":"And so you're also training a loss based on this next sentence prediction task."},{"from":4276.33,"to":4279.66,"location":2,"content":"And so it'll be something like: The man went to the store."},{"from":4279.66,"to":4281.43,"location":2,"content":"He bought a gallon of milk."},{"from":4281.43,"to":4284.61,"location":2,"content":"You're meant to predict true is the next sentence,"},{"from":4284.61,"to":4286.74,"location":2,"content":"um: The man went to the store."},{"from":4286.74,"to":4288.09,"location":2,"content":"Penguins are flightless."},{"from":4288.09,"to":4289.52,"location":2,"content":"You're meant to say false."},{"from":4289.52,"to":4291.28,"location":2,"content":"This isn't the next sentence."},{"from":4291.28,"to":4293.58,"location":2,"content":"And so they're simultaneously also,"},{"from":4293.58,"to":4296.32,"location":2,"content":"um, training with this representation."},{"from":4296.32,"to":4300.35,"location":2,"content":"So, what they end up looks, looks like this."},{"from":4300.35,"to":4304.24,"location":2,"content":"Um, so, they have,"},{"from":4304.24,"to":4305.49,"location":2,"content":"um, for the input,"},{"from":4305.49,"to":4307.17,"location":2,"content":"they'll have a pair of sentences."},{"from":4307.17,"to":4308.7,"location":2,"content":"My dog is cute."},{"from":4308.7,"to":4310.1,"location":2,"content":"Um, separator."},{"from":4310.1,"to":4311.93,"location":2,"content":"He likes playing."},{"from":4311.93,"to":4317.95,"location":2,"content":"Um, the words are represented as word pieces like we talked about last week."},{"from":4317.95,"to":4321.57,"location":2,"content":"Um, so there's a token embedding for each word piece."},{"from":4321.57,"to":4325.35,"location":2,"content":"Um, then there's a positional embedding for"},{"from":4325.35,"to":4329.53,"location":2,"content":"each word piece which is gonna be summed with the token embedding."},{"from":4329.53,"to":4334.47,"location":2,"content":"And then finally, there's a segment embedding for each word piece which is simply"},{"from":4334.47,"to":4337.05,"location":2,"content":"whether it comes from the first sentence or"},{"from":4337.05,"to":4339.91,"location":2,"content":"the second sentence before or after the separator."},{"from":4339.91,"to":4344.94,"location":2,"content":"So, you're summing those three things together to get the token representations."},{"from":4344.94,"to":4348.91,"location":2,"content":"And then you're going to use those in a transformer model"},{"from":4348.91,"to":4353.84,"location":2,"content":"where you will have losses to the extent that you can't predict the masked words."},{"from":4353.84,"to":4358.41,"location":2,"content":"And then your binary prediction function as to whether there's"},{"from":4358.41,"to":4363.52,"location":2,"content":"a correct next sentence or not which is the training architecture."},{"from":4363.52,"to":4367.48,"location":2,"content":"Okay. So, it's a transformer as before,"},{"from":4367.48,"to":4370.74,"location":2,"content":"it's trained on Wikipedia plus the BookCorpus."},{"from":4370.74,"to":4372.72,"location":2,"content":"And they built two models."},{"from":4372.72,"to":4377.18,"location":2,"content":"Um, the Base-BERT model was a twelve layer transformer."},{"from":4377.18,"to":4382.47,"location":2,"content":"And so this corresponded to what the previous transformer paper had used, right?"},{"from":4382.47,"to":4389.19,"location":2,"content":"Those two layer transformer blocks repeated six times gave you 12 layers with 768 hidden,"},{"from":4389.19,"to":4394.66,"location":2,"content":"um, dimension hidden states and 12 heads for the multi-head attention."},{"from":4394.66,"to":4396.48,"location":2,"content":"And then they went bigger,"},{"from":4396.48,"to":4398.61,"location":2,"content":"um, and trained BERT-Large which is,"},{"from":4398.61,"to":4400.62,"location":2,"content":"sort of, double the number of layers,"},{"from":4400.62,"to":4403.48,"location":2,"content":"bigger hidden states, even more attention heads."},{"from":4403.48,"to":4406.41,"location":2,"content":"Um, and training these on,"},{"from":4406.41,"to":4409.19,"location":2,"content":"um, pods of TPUs."},{"from":4409.19,"to":4413.85,"location":2,"content":"Um, so, first of all, you're training, um,"},{"from":4413.85,"to":4418.26,"location":2,"content":"on this basis for masked words and,"},{"from":4418.26,"to":4420.38,"location":2,"content":"um, next sentence or not."},{"from":4420.38,"to":4425.94,"location":2,"content":"Um, so then what they wanted to say was this pre-trained model,"},{"from":4425.94,"to":4431.69,"location":2,"content":"um, evaluated on these losses and masked language model and next sentence prediction."},{"from":4431.69,"to":4434.93,"location":2,"content":"Um, we could then take this model,"},{"from":4434.93,"to":4439.05,"location":2,"content":"fr- freeze most of its what weak. No, sorry, that's wrong."},{"from":4439.05,"to":4441.27,"location":2,"content":"We could take this model, um,"},{"from":4441.27,"to":4446.61,"location":2,"content":"pre-trained and it would be incredibly useful for various different tasks."},{"from":4446.61,"to":4448.8,"location":2,"content":"We could use it for named entity recognition,"},{"from":4448.8,"to":4452.31,"location":2,"content":"question answering, natural language inference et cetera."},{"from":4452.31,"to":4454.89,"location":2,"content":"And the way we're going to do it, is kind of,"},{"from":4454.89,"to":4458.55,"location":2,"content":"doing the same thing as the ULMFit model did."},{"from":4458.55,"to":4460.76,"location":2,"content":"We're not just going to say here's our,"},{"from":4460.76,"to":4465.24,"location":2,"content":"here's a contextual word representation like ELMo did."},{"from":4465.24,"to":4469.56,"location":2,"content":"Instead, what we're gonna say is just keep on using this,"},{"from":4469.56,"to":4472.23,"location":2,"content":"keep on using this um,"},{"from":4472.23,"to":4476.88,"location":2,"content":"transformer network that we trained as a, sort of,"},{"from":4476.88,"to":4482.53,"location":2,"content":"language model, but fine tune it for a particular task."},{"from":4482.53,"to":4485.19,"location":2,"content":"So, you're now going to run this transformer"},{"from":4485.19,"to":4489.18,"location":2,"content":"calculating representations for a particular task."},{"from":4489.18,"to":4495.99,"location":2,"content":"And what we're going to change is we're going to remove the very top-level prediction."},{"from":4495.99,"to":4500.41,"location":2,"content":"The bits that predict the mass language model and next sentence prediction."},{"from":4500.41,"to":4502.77,"location":2,"content":"And we're going to substitute on it,"},{"from":4502.77,"to":4508.08,"location":2,"content":"on top, um, a final prediction layer that's appropriate for the task."},{"from":4508.08,"to":4511.01,"location":2,"content":"So, if our task is SQuAD question answering,"},{"from":4511.01,"to":4516.34,"location":2,"content":"our final prediction layer will be predicting start of span and end of span,"},{"from":4516.34,"to":4520.74,"location":2,"content":"kind of, like when we saw DrQA a couple of weeks ago."},{"from":4520.74,"to":4523.98,"location":2,"content":"If what we're doing is the NER task,"},{"from":4523.98,"to":4526.89,"location":2,"content":"our final prediction layer will be predicting"},{"from":4526.89,"to":4533.9,"location":2,"content":"the net- named entity recognition class of each token just like a standard NER system."},{"from":4533.9,"to":4542.77,"location":2,"content":"Okay, um, and so they built this system and tested it on a whole bunch of data sets."},{"from":4542.77,"to":4545.61,"location":2,"content":"Um, one of the main things they tested on was"},{"from":4545.61,"to":4548.63,"location":2,"content":"this GLUE data set which has a whole bunch of tasks."},{"from":4548.63,"to":4550.17,"location":2,"content":"A lot of the tasks, they're,"},{"from":4550.17,"to":4553.53,"location":2,"content":"uh, natural language inference tasks."},{"from":4553.53,"to":4557.2,"location":2,"content":"And I've kept saying that phrase all of this lecture but I haven't really defined it."},{"from":4557.2,"to":4560.82,"location":2,"content":"So, with a natural language inference you're given two sentences"},{"from":4560.82,"to":4565.94,"location":2,"content":"like: Hills and mountains are especially sanctified in Jainism."},{"from":4565.94,"to":4569.55,"location":2,"content":"And then you can write a hypothesis on: Jainism hates nature."},{"from":4569.55,"to":4571.53,"location":2,"content":"And what you're meant to say is,"},{"from":4571.53,"to":4573.57,"location":2,"content":"whether the hypothesis, um,"},{"from":4573.57,"to":4575.51,"location":2,"content":"follows from the premise,"},{"from":4575.51,"to":4579.24,"location":2,"content":"contradicts the premise, or has no relation to the premise."},{"from":4579.24,"to":4581.27,"location":2,"content":"So, that's a three-way classification."},{"from":4581.27,"to":4583.85,"location":2,"content":"And so here it contradicts the premise."},{"from":4583.85,"to":4590.11,"location":2,"content":"Um, there are various other tasks such as this linguistic acceptability task."},{"from":4590.11,"to":4593.55,"location":2,"content":"Um, but if we look at these, um, GLUE tasks."},{"from":4593.55,"to":4597.73,"location":2,"content":"Um, these are showing the Pre-OpenAI State Of The Art."},{"from":4597.73,"to":4600.73,"location":2,"content":"How well, um, ELMo works."},{"from":4600.73,"to":4603.9,"location":2,"content":"How well OpenAI GPT works,"},{"from":4603.9,"to":4608.41,"location":2,"content":"and then how well do small and large BERT models work."},{"from":4608.41,"to":4613.29,"location":2,"content":"And effectively, what you're finding is,"},{"from":4613.29,"to":4617.37,"location":2,"content":"um, that the OpenAI GPT was so,"},{"from":4617.37,"to":4618.49,"location":2,"content":"you know, pretty good."},{"from":4618.49,"to":4622.45,"location":2,"content":"It showed actually good advances on most of these tasks."},{"from":4622.45,"to":4625.89,"location":2,"content":"For many, but not all of them that broke the previous state of the art,"},{"from":4625.89,"to":4628.99,"location":2,"content":"showing the power of these contextual language models."},{"from":4628.99,"to":4635.2,"location":2,"content":"But the bidirectional form of BERT's prediction just seemed much better again."},{"from":4635.2,"to":4639.18,"location":2,"content":"So, going from this line to this line you're getting depending on"},{"from":4639.18,"to":4643.19,"location":2,"content":"the task about two percent better performance."},{"from":4643.19,"to":4647.01,"location":2,"content":"And so the BERT people actually did their experiments carefully."},{"from":4647.01,"to":4650.43,"location":2,"content":"So, these models are pretty comparable in terms of size,"},{"from":4650.43,"to":4653.77,"location":2,"content":"but the bidirectional context seems to really help."},{"from":4653.77,"to":4655.47,"location":2,"content":"And then what they found was,"},{"from":4655.47,"to":4657.57,"location":2,"content":"well, by going to just a bigger model,"},{"from":4657.57,"to":4661.55,"location":2,"content":"again, you could get another big lift in performance."},{"from":4661.55,"to":4664.74,"location":2,"content":"And so you're getting for many of the tasks about"},{"from":4664.74,"to":4668.15,"location":2,"content":"another two percent lift in performance going into the bigger model."},{"from":4668.15,"to":4671.01,"location":2,"content":"So, this really produced super-strong results."},{"from":4671.01,"to":4674.09,"location":2,"content":"And in general, um, people have found,"},{"from":4674.09,"to":4677.4,"location":2,"content":"um, that BERT continues to give super strong results."},{"from":4677.4,"to":4681.48,"location":2,"content":"So, if I return back to my ConLL NER task,"},{"from":4681.48,"to":4685.26,"location":2,"content":"we had ELMo giving you 92.2,"},{"from":4685.26,"to":4686.64,"location":2,"content":"um, and you, sort of,"},{"from":4686.64,"to":4688.05,"location":2,"content":"continue to get gains."},{"from":4688.05,"to":4693.9,"location":2,"content":"So, BERT Base gets you to 92.4 and BERT Large takes you to 92.8."},{"from":4693.9,"to":4697.65,"location":2,"content":"Though in, um, truth in, truth in description,"},{"from":4697.65,"to":4703.13,"location":2,"content":"there is now a system of beats BERT Large on NER which is actually a character-level,"},{"from":4703.13,"to":4705.99,"location":2,"content":"um, transformer language model from Flair."},{"from":4705.99,"to":4707.84,"location":2,"content":"Um, but, you know,"},{"from":4707.84,"to":4710.79,"location":2,"content":"this continued over to a lot of other things."},{"from":4710.79,"to":4713.86,"location":2,"content":"So, on SQuAD 1.1, um,"},{"from":4713.86,"to":4716.37,"location":2,"content":"BERT immediately just outperformed"},{"from":4716.37,"to":4719.74,"location":2,"content":"everything else that people have been working on for SQuAD for ages."},{"from":4719.74,"to":4722.61,"location":2,"content":"In particular, what was especially dramatic, um,"},{"from":4722.61,"to":4725.98,"location":2,"content":"was the sing- a single BERT model, um,"},{"from":4725.98,"to":4730.77,"location":2,"content":"beat everything else that had been done previously on SQuAD version 1.1,"},{"from":4730.77,"to":4733.57,"location":2,"content":"even though they could also show that an"},{"from":4733.57,"to":4739.81,"location":2,"content":"ensemble of BERT models could give further good, um, performance gains."},{"from":4739.81,"to":4743.06,"location":2,"content":"Um, and as I've mentioned before,"},{"from":4743.06,"to":4745.98,"location":2,"content":"essentially if you look at the SQuAD 2.0, um,"},{"from":4745.98,"to":4748.94,"location":2,"content":"leaderboard, all of the top ranked systems,"},{"from":4748.94,"to":4752.28,"location":2,"content":"um, are using BERT one place or another."},{"from":4752.28,"to":4754.59,"location":2,"content":"Um, and so that,"},{"from":4754.59,"to":4756.06,"location":2,"content":"sort of, led into this,"},{"from":4756.06,"to":4759.57,"location":2,"content":"sort of, new world order, um, that, okay,"},{"from":4759.57,"to":4762.73,"location":2,"content":"it seems like the state of NLP now is to,"},{"from":4762.73,"to":4765.24,"location":2,"content":"if you want to have the best performance,"},{"from":4765.24,"to":4766.41,"location":2,"content":"you want to be using"},{"from":4766.41,"to":4771.85,"location":2,"content":"these deep pre-trained transformer stacks to get the best performance."},{"from":4771.85,"to":4773.22,"location":2,"content":"And so this is, sort of, making,"},{"from":4773.22,"to":4775.41,"location":2,"content":"um, NLP more like vision."},{"from":4775.41,"to":4778.56,"location":2,"content":"Because really vision for five years has had"},{"from":4778.56,"to":4782.73,"location":2,"content":"these deep pre-trained neural network stacks, um, like ResNets."},{"from":4782.73,"to":4787.12,"location":2,"content":"Where for most vision tasks what you do is you take a pre-trained ResNet,"},{"from":4787.12,"to":4789.87,"location":2,"content":"and then you fine tune a layer at the top to"},{"from":4789.87,"to":4792.87,"location":2,"content":"do some classification tasks you're interested in."},{"from":4792.87,"to":4794.97,"location":2,"content":"And this is, sort of, now, um,"},{"from":4794.97,"to":4797.52,"location":2,"content":"starting to be what's happening in NLP as well."},{"from":4797.52,"to":4800.28,"location":2,"content":"That you can do the same thing by downloading"},{"from":4800.28,"to":4805.88,"location":2,"content":"your pre-trained BERT and fine tuning it to do some particular performance task."},{"from":4805.88,"to":4809.4,"location":2,"content":"Okay, um, that's it for today and more on"},{"from":4809.4,"to":4818.33,"location":2,"content":"transformers on Thursday [NOISE]."}]}