Placeholder Image

Subtitles section Play video

  • Katrin Erk>> It's so incredibly to process language which is ridiculous

  • because little children can speak. Little kids learn language

  • by being in the middle of the world. So, they have all these

  • visions, sound and they have language to go along with that.

  • The way that computers learn language nowadays is

  • more like if you sat a baby down with

  • a huge pile of The New York Times and said, "OK, here you go, now learn language."

  • One problem with human language is that it's horribly

  • ambiguous. Take, for example, the word "run." You can

  • run a race, run a company or you

  • run your car into a bog. Run-ing, running.

  • Traditionally what people have done is try to get the computer to

  • pick up on patterns using a dictionary. So,

  • here is this clear list of senses, so, run has say, 20

  • and here's the list. And then,

  • if you had one occurrence like "he ran the company" you would have to say,

  • "OK, that is sense #12 and not sense #11 and not sense #13."

  • Trouble is, all of those meanings are somewhat related

  • but, they're still different because you draw different conclusions.

  • Now, what's the poor computer to do. So, I'm thinking

  • maybe we can't distinguish senses as clearly say, here's where one

  • sense begins, here's where the next sense stops. What I'm doing

  • is to represent each time you use a word like run

  • with the context in which it appears. So, all of this context

  • you can put into numbers and then present such a context as a point

  • in a high dimensional space. And in order to represent

  • words as these points in high dimensional space I need a whole

  • lot of data. I need to have all that context and I need to have it in a form

  • that I can count it, which means, 100 million words is good,

  • a billion is better, 2 billion or 3 billion words, yeah,

  • then you can actually get decent models. But if you want to

  • compute with that amount of data and you do this on a single desktop machine

  • then you better be prepared to wait for a long time and I did that for a while.

  • I'd start an experiment, wait 3 weeks to see how it got out,

  • that's painful, really painful.

  • With TACC I can do the same things in a couple hours. So, we need super-computing because

  • all of natural language processing these days but, in particular when you want to do

  • stuff with word meanings you need to use a lot of data and I mean a lot of data.

Katrin Erk>> It's so incredibly to process language which is ridiculous

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it