Kathleen McKeown is the Henry and Gertrude Rothschild professor of computer science at Columbia University, and the founding director of the school’s Data Science Institute. McKeown received a PhD in computer science from the University of Pennsylvania in 1982, and has been at Columbia since then. Her research interests include text summarization, natural-language generation, multimedia explanation, question answering, and multilingual applications.
McKeown has received many honors and distinctions throughout her career, including being selected a AAAI Fellow, an ACM Fellow, and one of the Founding Fellows of the Association for Computational Linguistics (ACL). Early in her career she received a National Science Foundation Presidential Young Investigator Award; in 2010, she received the Anita Borg Women of Vision Award in Innovation for her work on text summarization; and in 2019 she was elected to the American Academy of Arts and Sciences.
McKeown also is an Amazon Scholar, an expanding group of academics who work on large-scale technical challenges for Amazon while continuing to teach and conduct research at their universities. In early July, she is the keynote speaker at ACL 2020 — the annual conference of the Association for Computational Linguistics.
We recently spoke with McKeown about the field of natural-language processing, her career, and her keynote topic for ACL 2020: Rewriting the Past: Assessing the Field through the Lens of Language Generation.
What drew you to the field of natural-language processing?
My undergraduate major was in comparative literature. I also majored in math, so I had both of those interests. But it wasn't until my senior year as an undergraduate that I learned about computer science and the field of computational linguistics. What got me interested in computational linguistics was that I could bring my two interests together, so I applied to graduate school in computer science.
How did you come to join the Amazon Scholars program?
I was on a sabbatical and someone I knew at Amazon asked me if I’d be interested in working there. And I thought, ‘Well, that would be fun to do on my sabbatical.’ It took a while to happen; I was well into the second half of the sabbatical when it did.
But I’ve continued doing it, one or two days a week. I like the work — the industry perspective helps with my academic research. And working at Amazon is a lot like working at Columbia. There are a lot of young people, and they’re very bright. Plus, the tools we use at Amazon — setting up a problem and de-bugging it — give me some insight into what I should have my students looking at.
How has the field evolved during your time working in it?
The ACL 2020 conference has a theme of “Taking stock of where we are and where we are going with natural-language processing.” Before neural nets [computer systems modeled after the human brain], people were using statistical methods, machine learning, discrete methods. Then in 2014 there were some startling advances in natural-language processing using neural networks — mostly in machine translation. In the two or three years after that the whole field shifted.
I believe we should be moving on to harder, novel problems. One is the summarization of chapters in novels, where we’ve used as a data set chapters of books taken from Project Gutenberg.
My ACL talk will focus on language generation and summarization. Neural networks have had a huge impact in those areas. Language generation has really been transformed. Today we can really do language generation from a lot of unstructured data. We’re really seeing some very creative work; at Columbia, we've been working on the generation of arguments in the context of generating counterarguments. How do you do that? How do you generate text that is persuasive?
One of the powerful tools being used right now is BERT [Bidirectional Encoder Representations from Transformers], which came out of Google in 2019. BERT has a pretty good idea of how a sentence fits together grammatically and through fine-tuning enables learning from smaller data sets than was possible before.
What’s the current state of natural-language processing?
One of the problems with current approaches is that people grab onto a data set that is available, then work on that data set to get a result — whether or not that solves a problem that needs to be solved. For some time now people have focused on using natural-language processes to summarize news stories. That’s not something we really need — the story’s lead is often a very good summary.
I believe we should be moving on to harder, novel problems. One is the summarization of chapters in novels, where we’ve used as a data set chapters of books taken from Project Gutenberg.
We’ve been doing this in our work at Amazon, where we are developing a system to generate summaries of chapters using as training data, the chapters from Project Gutenberg, and summaries from online study guides.
That is a hard problem, and a very interesting problem. That’s because the study guides use a huge amount of paraphrasing of novel chapters and trying to teach a computer to understand what is a paraphrase and what is not is really hard.
How will the field change over the next five or 10 years?
That’s a hard question. Just five years ago we couldn’t generate language from unstructured data like images or video, so the field is moving quickly.
One of the things I’m collaborating on is how can we take meeting recordings and generate summaries — action items and things like that. And I’d like to do more with the summarization of novels. I love that work. Summaries are often written in everyday language, while the books themselves have a completely different style from a very different time. So matching everyday language and the language in a book is difficult.
Overall, I think we’ll see big improvements in three areas: One is machine translation. We live in a global world, and there is a huge need to be able to understand documents in other languages. The second is in conversational systems. I would love it if we could develop systems that could be true companions — think of how beneficial that could be to the elderly who are isolated because of COVID-19. And I think we’ll see better ways to get good answers when asking questions on the web.
And third is how we interact with online information. There is just so much on the web, so the ability to summarize content and then drill down into it will be extremely important.
We as computer scientists and people who work with language need to think about how we can help. Look at the COVID-19 epidemic — natural-language processing might help us better track the evolution of a disaster.
One last thing. ACL 2020 will be virtual this year. Is that difficult to work with?
[Laughs] In some ways, being remote makes things easier. I won’t see a big audience in front of me — so I’ll be less nervous!