Pedro Rodriquez, Arthur Spirling, and Brandon Stewart speaking on Embedding Regression: Models for Context-Specific Description and Inference.
The Hoover Institution hosts a seminar series on Using Text as Data in Policy Analysis, co-organized by Steven J. Davis and Justin Grimmer. These seminars will feature applications of natural language processing, structured human readings, and machine learning methods to text as data to examine policy issues in economics, history, national security, political science, and other fields.
Our 17th meeting features a conversation with Pedro Rodriquez, Arthur Spirling, and Brandon Stewart on Embedding Regression: Models for Context-Specific Description and Inference on Tuesday, March 14, 2023 from 9:00AM – 10:30AM PT.
>> Justin Grimmer: Everybody, welcome to the Hoover Institution workshop on using TEQSA's data and policy analysis. This workshop features applications of natural language processing, structured human readings, and machine learning methods to TEQSA's data. To examine policy issues across a wide range of field, economics, history, national security, political science, and many other social science fields.
I'm Justin Grimmer. Steve Davis and I co-organize the workshop. Today, we're thrilled to have Arthur Spirling, who's professor of politics and data science at NYU. He's presenting a paper co-authored with Pedro Rodriguez and Brandon Stewart. It's entitled Embedding Regression, Models for Context-Specific Description and Inference. Some quick rules before I turn it over to Arthur.
Arthur's gonna speak for about 30 to 40 minutes. If you have questions, please place them in the Q&A feature. Steve and I might interject if there are pressing questions. And Arthur or Brandon, who's here as well, one of his co-authors, might respond to clarify questions in the Q&A.
Or Steve and I might recognize you and ask you to ask your question live. After about an hour, we're gonna turn the recording feature off and we'll turn it over to a more informal Q&A session. And there, you can ask more nuts and bolts style questions, as well as drilling deeper on any remaining questions that you have.
So with that, Arthur, take it away.
>> Arthur Spirling: Thank you, Justin, and good morning, everyone. Pleasure to have so many people online and look forward to a fruitful discussion today. So I'm Arthur Spirling. I'm presenting this paper that's called Embedding Regression, and my co-authors are Pedro and Brandon. I'm also gonna present some extensions of the work with Elisa versioning today, but that will be later in the talk.
So I want to start out with some images of political elites having debates or having meetings and thinking about meaning. And that's, in a way, a sort of strange, abstract thing to think about. But for example, right, we think about Republican and Democratic elites, or indeed Republican and Democratic voters.
And what do they understand by the term immigration? What do they mean by that term when they think about what an immigrant is, discussing immigration with each other? At one point, Biden and DeSantis were. What do they mean by that? Do they mean people who are potentially undocumented?
Do they mean people who come as students, become green card holders, become citizens? What is their understanding of that term? If we think about the historical record in the 20th century between the UK and the US, what do elites in the UK and the US understand by the term empire in the post-war period?
Here's a couple of elites, we've got Macmillan here, part of this, prime minister in the UK, when it was going through decolonization, particularly of Africa, and we have Kennedy here, right? And they had many discussions about what the role, basically reduced role of the UK would be in the sort of post-war order.
But what did they understand when they talked about empire? And then finally, maybe a more recent example, which is connected to sentiment. So that is to say affectation about how we feel about things in terms of how we understand them. What is the meaning that, say, Conservative backed benches in the UK who pushed for Brexit essentially all my lifetime?
What do they infer when they hear a term like the EU? How can we think about what that means to them? How would we measure that? So at a broad level, those are the types of problems we're gonna be dealing with today. So what do we mean by mean?
Well, if you spend any time in linguistics, you will see this quote. You probably won't read the underlying book collection from which it's from. It's a quote by Firth in 57. It says, you shall know a word by the company it keeps. And this and similar sentiments have become known as the distributional hypothesis.
So this is the idea that if we see two words in the same context, and I'll be quite specific about context in a minute. But for now, you can think of it as with the same words around them. If you see two words that appear in the same context, we might infer that they mean the same thing.
That's not the same as knowing their meaning. It means more that we think that these two terms probably mean the same thing as each other. That's the nature of it. Nonetheless, right, that should help us make inferences across context. So if the context differs for a word, that might tell us that these terms mean something different.
And just to make this idea firm, suppose that we observe that Republicans, elites, voters, respondents, whatever it might be, they tend to use the word illegal when they're talking about immigration. As in the term illegal is close to the word immigration in their speeches. But we also observe Democrats don't do this, right?
Then we might infer that there's something different that's meant by immigration by these two different groups. And we're gonna try to make that very firm and specific in what follows. So what we're doing here is assuming some sort of bare bones structuralist account of language. So meanings are different if distributions are different.
So the problem is to work out if those distributions are different. So this isn't a sort of deep understanding of language, where we're trying to think about what's going on in terms of the chemical reactions in someone's head, in their brain, or something, right? We're gonna compare some distributions.
We're gonna have some machinery for doing that, and that's going to be how we think about it. The way we're gonna do this is with word embeddings. If you are not familiar with word embeddings, I'm gonna give you a very brief crash course. So what are they? Literally, they are real valued vectors of numbers.
So if you ask for an embedding of some word, like cat in English or chat in French or something like this, right, I would give you back some length, say, 100 or 300 set of numbers. And where do they come from? Well, typically, they come from some sort of predictive problem.
So we're interested maybe in some elaborate neural network of predicting the probability of the next word, given the word that we've just seen. And of course, we can think of industrial examples of this. So for example, on your phone, if you have predictive texting or something, that's an attempt to do this problem in real time very, very quickly.
But things like ChatGPT are trying to do the same thing, too. They're trying to predict the next word, given what was written before. That's more complicated in that case, because they're predicting lots of words, but that's the basic idea. So here's the idea, right? We might be interested in what's the probability of seeing the word immigration, given we observed the word illegal.
And that might be high for one group and low for another. And so you'll see pictures like this of neural networks. This is a very famous particular case called word2vec. And the idea is that there is a weight matrix on the left-hand side, and our embeddings, which are these vector representations of these words, come from those weights.
This attempt to say, fit these models perfectly, where we're trying to predict the next word, given the word that came before. So again, just to fix ideas, I want to think about a sentence like this. I had a cup of coffee this morning, and we're looking at some English corpus, and we observe this sentence, right?
And we'd like to know perhaps the relationship between coffee and cup, and coffee and tea, and cup and tea, or something like that. And so here's a context, a two-word context, that would be small for most kind of real-life industrial applications. But still, this is a context around the word like coffee.
And if we observe that enough, right, we may infer via these weights by these Embeddings that cup and coffee are in some sense close together. So in this case, they have this kind of post physical location, but we may infer that their embeddings lie in a similar, not far from each other.
And so we infer these must have something to do with each other. Now, what if I continue reading this English corpus and I see, you know, I have a cup of tea this afternoon. So this may help us infer that tea and cup close together. And in fact, even if we never see coffee and tea in the same context, in the same sentence by a cup, we might infer that they're somehow related.
In this case, they're both drinks. Cup is not a drink, right, but it's something that we use for drinks. And so we infer that these words are close in some space. So to be clear, these word embedding approaches are popular. I mean, that's something of an understatement, I think.
So in the original two big architectures that you'll see, well, in academia and other places are something called word2vec, and so the love, which was the latter was designed at Stanford. And these have hundreds of thousands of citations by now. So this is 100,000 citations of these, just three papers.
And this is sort of under counts it really, because these are used in industry all the time where they may not be cited. So they're very, very popular ideas, basically, and we're seeing them in social science, too. So you will see papers already about things like can we think about social class through embeddings?
Can we think about how media attention to terms like quality, or rather how the media reports on different subjects and tries to think about them through an equality lens? How has that changed over time? People have studied how parliaments, the ideological arrangements in parliaments have changed based on embeddings.
So these are out there and people are using them and they're very popular. And this, in a way, is our sort of jumping off point to get into our paper. So hopefully, embeddings are good. They're very widely used and they're given a bit of a crash course in terms of what they mean.
But in practice, when we think about what social scientists want on an everyday basis, they want a bit more. So it's not just that they want embeddings of some political terms, though certainly that's helpful. They want to, in fact, quantify differences in meanings as those meanings occur, to, say, in different groups.
And they'd also like to talk about associated uncertainty. So they want to talk about, is there a statistically significant difference between Republicans and Democrats on this subject as regards this embedding, as regards their understanding of a particular term? So we call that general demand embedding regression. And I hope that we have supplied a solution to it.
But when we use the term embedding regression, we mean resolving this issue. And we have in mind, again, just to fix ideas, something like multiple regression. So you're gonna have embeddings on the left-hand side and I'll go into some detail about how and why that's a trickier problem than it first appears.
And then we're gonna have some group memberships on the right-hand side. So I'm gonna be able to talk about this is the effects of belonging to some particular group in terms of its embedding outcome. So here for example, X1 next to the group memberships, although in fact, they don't need to be binary and they could be something more interesting.
So I'm gonna set up the problem first in a sort of small way where the issue is we'll show that it's technically solvable in an easy part of the problem and then we're gonna lead to a more difficult part. So obviously, just to be clear, embeddings are not scalars, right?
So they're not a single term. So an embedding I is, it has some long vector representation. So we need a little bit of machinery to sort of take care of that. Can't stop everything into a multiple regression and off we go. So what we do in the paper is we set this up as something called a multivariate regression.
And as I say, this is the kind of easy part of the problem. So these regressions are maybe well-known in psychology where you have multiple stimuli, or a stimuli, you're trying to understand how it affects multiple outcomes. So I believe in that discipline, they measure things like blood pressure and heart rate and the conductivity of the skin and so on and so forth.
You can use multiple measures and you want to see how that is affected by some X changing. But just to fix ideas, right? Notice that Y for us is going to be a matrix. We're gonna have these n number of these embeddings and they're gonna each be of dimension D, okay.
So that is just, so you see what an embedding looks like. It looks something like this. So this is a yi, this is some embedding of some word, and that is what our dependent variable matrix looks like. We wanna have that on the left-hand side and we wanna have group memberships on the right-hand side, okay.
But in a way that's said, that's the straightforward part. The hard part is this part here. So if we wanna set the problem up that way, I need single embeddings each time, right? So what I'm saying is every time someone, whether they be Republican or Democrat, uses a term, I would like to have an embedding for them of that term's use.
It is straightforward, and you could do it in two minutes to go and find out what is the embedding for the word immigration in the discourse. You just download something from the Stanford website, look up what it is. That's the embedding, right? That's not what we want. We want an embedding for every time that term is used, right?
So if it's used three times in a speech, I want three embeddings on the left-hand side. And that's a hard problem. It's a challenging problem. So we turn to some earlier results, in this case, from Aurora. So it says that if you have a word, this is the nature of this result.
It says if you have a word, you could get an embedding for it by merely taking the mean of the embeddings of the words around it. So generally, we think that those embeddings we could get from something like this glove database, this big glove matrix, we download in two minutes.
And I would just look around the word that I care about, take the average, and that should give me a good representation of that particular instantiation in that particular speech of that word in some embedding space. So here's an example from someone's memoirs, right? The debate lasted hours, but finally, we voted on the bill, and it passed with a large majority.
But I've got the context of the word bill, and I want an embedding for that term, not bill in general in the discourse, but bill here in this memoir, as written by this politician. So what should work, almost works, not quite, is taking the mean of at previous embedding that we got promoted on the, and it passed.
The problem, though, is that we need to downwind directions of function words. So in that previous slide, we had words like it, for, or whatever it was, right? And that's not very helpful, and it tends to overwhelm our embeddings. There's lots of these function words, and we end up kind of with a very noisy embedding that's not really reflective of how people are using bill in that particular sentence.
So we turn to some previous work that Brandon and colleagues did, which suggests multiplying this mean by a transformation matrix A. This is the formula for obtaining A, that's not new to our paper, but we do apply it in our paper. It It turns out this can be learned relatively easily from a large corpus, and you only need to do it once.
So once you put together this A, which is D by D. So if you were embedding the dimension 300, you're gonna have a 300 by 300 matrix. And then if you want a new embedding, that is to say, I want a new embedding of my special word, Bill, as used by this politician in this memoir, right?
I take the average of the words around it, those used, and I multiply by this A, and I get out some nice, very precise, what is called a la carte embedding. So in what follows, sometimes I'll say ALC, and I'm referring to obtaining this A and then multiplying it by the averages and the embeddings that we get from that.
ALC is just a la carte. So let me just give some intuition about what's going on with this A matrix, what it looks like, and what it does for us. So what's going on over here in the sort of right part of this optimization? So vw is some original embedding from some large corpus.
So earlier, I spoke about things like glove, which is a resource that's been trained on various things like Wikipedia for modern English, right? Where we can go off and get a bunch of embeddings for various words. So I said that in principle, you could get an embedding for the word Bill from from glove, okay?
And then we've got this A multiplied by this mean. So where mean is the mean of the original embeddings from the big corpus, but now applied to our little focus corpus, for example, the memoirs of some politician. So what I'm saying intuitively is the following. Give me an A matrix which, when I multiply it by the averages of the embeddings around the word.
That's very specific to my person using this very specific term in his particular memoir, right? Gives me back embeddings that are sort of as close as possible, subject to this particular optimization, to the original embeddings, okay? So it's saying, look at these glove embeddings, right? Give me back an A matrix which sort of pushes that mean towards those original embeddings.
So what's happening? Well, as I say, I don't just want to take the mean of some words around my kind of focus corpus, right? So let's say I'm interested in a word like rights, which I was interested in for actually for different projects of some 17th century pamphlets, right?
I don't want it to become overwhelmed by the average of the and of, right? That doesn't seem like the right idea. And the A matrix is in some sense preventing that from happening. The intuition is it's forcing that new embedding, that very, very specific embedding, for my very, very specific use of this term to be at least somewhat similar to pulling it back towards what we learned about rights from, say, Wikipedia, as in the embedding of rights, right?
And that stops it getting pulled away from it, stops it just reflecting function words. The other thing that we sometimes have, and there are different ways to do this, is we also have a sort of account waiter. What's going on here is that, basically, this is reflecting the fact that we are more confident in some embeddings than we are in others.
So if you think about it, even in a large corpus, right? We're a bit more confident when, for words that we see all the time, we know a lot about them. And some less common words, we're sort of less confident in. And this helps us sort of reweight towards the more common words.
But it avoids us, again, because we typically use these log functions, it avoids us becoming overwhelmed by these function words. Okay, so why does this matter? So that was a disposition about these technical details, right? So who would care about this, right? Why is this important for understanding, meaning, at least as we first discussed?
If I can produce embeddings for any word, then I can produce them for any instance of any word. That's the idea. That means I can set up that matrix on the left hand side in that regression equation, and then I can go off and do my regressions. So does it work?
Well, let's take a look at, say, New York Times mentions of lowercase t and uppercase T Trump. And we're gonna provide ALC embeddings for everyone and see if they're different across, but similar within. So that is to say, just to clarify the problem here, right? Let's take a sentence like this.
Who is this talking about? This is talking about the named entity, Donald Trump, right? Capital T Trump, that's who it's referring to. And I would like my embedding for capital T Trump to sit near the other embeddings for capital T Trump. So I'm gonna try and get an embedding, an ALC embedding for this one instantiation of Trump and see if it works.
And just for visualization, we're gonna put it in two dimensions. But there are all sorts of ways that you could measure this. Okay, then we had another incidence. Let's build an embedding just from this tiny little fragment. There it is. Okay, now we're finding a trump that's not this named entity, capital T, proper noun, Trump.
It's a reference to bridge or some other card game. We'd like to see that to not be too close to the other capital T Trumps. And we continue to do this, and we find that ALC is doing it. So what's going on here is we're saying every time it finds the word Trump and it builds this embedding, this really specific embedding locally to this one use of the term, right?
When we then look at that in two dimensions, they're sufficiently far apart that we believe these are different uses. That's great. So it's doing it off one particular use. They also make sense, and I won't belabor this too much. I'll just say that if you look at the right columns here, transformed, untransformed, right?
These are what we call good nearest neighbors, right? So these are things that we think are kind of should be close to a word like capital T Trump. Capital T Trump sort of should be close, in some space, to Biden, right? Lowercase t trump should be close, in some space, to bidder, bids, spades, and things like bridge games.
The other thing that we do in the paper before we get into our applications is we do a replication of a sort of what is state of the art, which is Emma Rodman's paper in political analysis. And she looks at, in the New York Times, for example, she looks at, on the bottom here, bottom left, she looks at the distance between gender and equality, right?
So it's trying to understand how is it that distance has moved over time. Is gender talked about in terms of equality, or is it discussed in some other frame? So she has all sorts of very nice and carefully done hand-coding stuff. And then she looks at a particular model, this particular chronological model, which she recommends for end users.
And we replicate her analysis, and we have almost identical patterns. So the thing to notice from this graphic is that our ALC line basically moves as her line does for all these different subjects. But I want to emphasize the speed. So to fit the types of models that she's interested in, she's taking about 5 hours on this corpus, and we're gonna get done in about a second.
Something like that, right? So this is just very, very fast and pretty accurate, right? So we have this sense of, we feel happy about this, that it works. Okay, well, let's get back into this regression business. So just a slight abusive notation here. Let's suppose that That I have two groups, group R and group B, and we'll assume they're binary.
So you're either in group R or you're not, and therefore you're in group D. So here is the modeling problem. I want to have my embedding on the left hand side. I want to regress it on whether you're in group R or not. And then I want to draw some conclusions.
So to reiterate, each Yi is a one by d dimensional vector, which has come from this ALC instance of the word, like those trump examples. Group R is just some indicator taking the value one. If you're involved, you could say in this case, if you're republican, but there's nothing special about that takes the value zero, otherwise.
So let's think about what we get out here, right. If I fit that regression using this multivariate setup, what is beta zero? Right. Or my estimate of beta zero. Beta zero is going to be this. It's the ALC embedding for this group d. It's sort of an average of everyone who's in group D, because Republican mechanically will take the value zero.
And so beta one times zero. So we'll be left with beta zero, which will be the average embedding for this group D. What about beta zero plus beta one? This will be the average embedding for group R. Right. We're just adding two things, right? And we're going to end up with an average embedding for Republicans.
We have some various ways in the paper to think about the magnitude, and we take norms and things. There isn't really a natural way of interpreting the beta zero or beta zero plus one, right. But you do get two sort of average embeddings, in fact. Right? One thing that previous discussants have said on this paper is, this seems like a machine, fast machine, maybe an accurate machine for producing average embeddings for groups.
And we would say, yes, that's what it does. And that's why it's a helpful thing to do. In the paper, we have various ideas. We show how you can quantify uncertainty here. We use some sort of ideas and tricks from previous papers so we can talk about something being statistically significantly different.
Basically, it's permutation in y, and there are various ways that you can bootstrap things as well. So let me work through some applications from the paper. So, firstly, one of the things that we wanted to check is that what terms do Republicans and Democrats, these two groups, I think this is in the congressional record.
What terms do they have just different understandings of? And again, if you want to interpret that as different uses of, that's fine, right. So the big ones, the ones with the largest differences are immigration, marriage and abortion. And the ones with the smallest differences are these function words.
So they sort of have these same understandings of these function words. You might be thinking, well, wouldn't we want to have continuous covariates? Wouldn't I want to regress it on something like a nominate score? And you couldn't do that. That's no problem, right. So this is the distance between these various terms which I've written on the right hand side of the plot, and immigration, right.
As we move from the leftmost House members to the right, most ones. And so we see, for example, that over here on the left, so you can use mostly Democrats or liberal Democrats, right. For example, a word like bipartisan is quite close to the word immigration. And then as we move across its republicans to the right, don't use that term when they're thinking about immigration.
A word like illegals, not really used very much on the left hand side of the chamber, again, in connection with immigration specifically. But as we move to the right, it's used more and more. So this kind of makes sense. What about our case of the UK versus the US, right?
So what we compare here is two completely independent corpora, right? So we're looking at the US congressional record and the UK Hansard over approximately the same period just before the second World War till almost modern period, right. And we're looking at how they understand the term empire. And you see this kind of once and for all decrease in coastline similarity because they stopped talking about empire in the same way.
And we can be specific about that. So basically what happens is early on they're thinking about empire as being about something that is about sort of European colonies in the world, but certainly after Suez, right. So there's the Suez crisis. And this is the sort of somewhat explicit admission that the US now will be the superpower, the UK will no longer dominate world affairs.
And we see the UK continues to talk about empire in terms of its own territories and colonies, but the US has switched to thinking about empire as being about communism, Soviet Union, revolution and so on. So it's kind of moved how it talks about empire. And we can do things in the, in the paper like we say, these are the terms of the most American understandings of empire in the 1950s.
So for example, talking about soviet communists, whereas these are the other terms on the left hand side that are most British. So finally our final example, I want to talk about the EU. So one of the things that political scientists are very interested in, of course, is sort of sentiment towards things.
So what we did, we said, well, let's look at different subject areas, topics that arise in, say, the early 21st century, and we're looking particularly at conservative back ventures here. So the pattern that we observe for, say, we've got education over here in this left panel and NHS in the middle, and the European Union on the right hand side, right.
There's this very particular pattern that we see in Westminster parliament. So it goes like this. When your party is not in power, you spend a lot of time saying how bad a given policy area is. And as soon as your party is in power running the show, you talk very glowingly about that policy area.
So education, education is terrible if I'm conservative and labor is running it, and then as soon as my government is running it, I start to say it's great and I start having this very positive failure. And we look at this for education, the NHS and the EU. The thing to notice is in this bottom right panel down here.
Is that the EU is this one particular area we found where even though the Conservatives get a conservative government, they never talk about it in a very sort of positive way. And they only start talking about it in a positive way when they get a referendum. It turns out this leads to the UK leaving the EU.
So it's just an example of something where we're able to show that using these embedding techniques, that they're never very positive about their own government's policy. Okay, so some concerns you may have. We can discuss these maybe more later, right. Is it sensitive to decisions that you make?
Well, it depends, right. In general need to stop the documents and remove the function words because it's going to take care of it by this a matrix don't need to stem them generally. Do I need a local similar corpus to pre-train on? I mean, yes, you do, but we found that a corpus like glove, which you can download in a few minutes from the website for free, will handle many, many kind of modern English examples.
Is there uncertainty in the estimation of A. Yes, but we wouldn't worry about it too much. We did a bunch of experiments on that. We can return to those. So one question we often get is, can this be expanded to something other than English, and the answer is yes.
And that's what we're working on at the moment. So if you have low data resources, you're working in a language like Upper Sorbian. I don't know much about Upper Sorbian, and I don't want to offend any Upper Sorbian speakers who may be on the chat, but I know it's not a very common language.
And it's hard to get large corpora to build up embeddings for that language. Language, so, by the way, people talk about this, English, Mandarin, these languages which are spoken, we have a lot of training data, very easy to build embeddings for but these other languages are tricky. So you might not just have much data, you may have low computational resources, it's hard.
You do maybe fit these embeddings models on your laptop or something. So we provide them, and we're gonna have a paper out soon about that. And we're able to show in that other paper that if it works very well, so you get really good nearest neighbors. So, for example, if you look at something like French, right, these are good nearest neighbors, so these kind of make sense.
The French embeddings we're providing are sort of meaningful and will be helpful to researchers working. So sort of almost wrapping up, right, I just wanna make clear that we have some software that's already out there. It's being downloaded fairly furiously, merely written by Pedro on, it's on cran, it has this particular LM style, GLM style syntax.
And so what you can do here is assign it to a model, so you're saying, let's use this context function. I want to regress immigration on Republican as a group membership, some data, some pre training and so on. And you can various selections, and you can get out regression coefficients, okay?
This is the main website that will take you to the paper, and it will also take you to the software. And then finally, my co authors, so, Pedro Rodriguez is a research scientist at Meta. Brandon Stewart is a writer at Princeton, he's on the call, and Elisa Wirsching is a graduate student at NYU, thank you, Justin.
>> Justin Grimmer: All right thank you, Arthur. I'm gonna start with a couple questions then, Steve, I assume you'll have some as well, and then we can open it up to the audience as well. Okay, so, I mean, this is just a completely fascinating paper, and I really enjoyed it.
One of the sort of big overarching questions I had was thinking about applying the a la carte method to different baseline embeddings. And you sort of touched on this at the end by discussing how you, of course, want your own locally trained to just sort of thinking about this going out in the world.
Suppose a researcher has two different embeddings, they do the a la carte on two different embeddings. They end up with regression coefficients of different magnitude or perhaps worst case, contrasting magnitudes, how would you think about adjudicating those results? And then that actually is gonna tie into a second question, which might be easier just to take now, but I can re ask it as well.
Sort of thinking about, you do this result, and you do an excellent job within the paper and the presentation of connecting it to sort of real world implications, I'm sort of thinking through for the audience or for me, I have this technology, I get a particular result. What are the steps that should be taken in order to say, okay, this is what this means, this is the sort of evidence I'd like to bring to bear to support the interpretation of this regression?
>> Arthur Spirling: Yeah, let me deal with the second one first, actually. So what we suggest in the paper, I don't know how explicit we are about these are the steps, but what we do in the paper is that we focus, as many other scholars do, on thinking about things like nearest neighbors.
So basically, I mean, your own work, you've written a lot about what it means to validate these things and the importance of elevation. I think we would say, do the nearest neighbors make sense, does this embedding that you've got out for this group have the right relationship to other things that you have some reasonable prior about, right?
And so we would be, as usual with these things, nervous about telling people to go and use them who haven't, don't have at least some, for want of better wording, contextual knowledge about the case that they're studying. And in fact, I mean, yeah, so even, looking at these British MP's, there's kind of various terms of art that you need to understand in order to know whether these embeddings make sense.
Let me return to the first one about you have two different embeddings, and how would you assess between them? So I think Brandon will probably step in a second on this, but my feeling is that you would have to make an argument that the underlying embeddings, the large corpus embeddings, are somehow a good match for your particular substantive problem, and I give an example of this.
So in a separate paper, I've been working with a colleague at NYU, and we were looking at some 17th century documents. We don't have very many, and so it's hard to put embeddings in general in 17th century English documents. So we use some parliamentary documents, journals from England at the time, in the 17th century, and we think that's pretty close.
But, for example, something like glove wouldn't work, and I give specific examples. So a word like sovereign, right, in the 17th century in England is a reference purely to the king, that's what it means. But sovereign today in glove, we're talking about sovereign debt and defaulting and so on and so forth.
Now, those words are whatever etymologically connected, but it means something different, but I'll let Brandon step in on what would you do in that event.
>> Brandon: Yeah, I think this is great, thank you. So I guess what I would say is it sort of depends on how they conflict.
So in a setting where the magnitudes conflict, I think we're not too worried at all cuz the magnitudes don't really have a lot of meaning. And so we're trying to be quite clear about that, it's one of the reasons it makes sense to kind of benchmark against function words and things like that that we do in the paper.
I think the interesting question is when different words that are sort of substantively different come out on top. And I think the part that's subtle here is that the, what you're getting from the magnitudes of the distance between words, and it's relative to the original embeddings in the corpus.
So you have to think about it in the context of whatever that original embedding was trained on. Now, we've not seen major differences between things like word to back and glove or whatever. And I would be sort of surprised, and if you do, I would get kind of nervous that, yeah, there's just sort of it's too noisy.
For whatever reason, the words you're looking at are too rare, the context are too strange. But if you had very context specific embeddings, the thing I would look to is is there a substantively different meaning of the focal words or the nearest neighbor words in the corpus that you're trying to study?
Which again, goes back to Arthur's point about substantive knowledge. The other thing I'll say is that one of the things that I think we like about this is that you can get pretty good nearest neighbors with even relatively few instances of the word. So, one of the things that Arthur showed was the replication of Emma Rodman's sort of really like path breaking work, she was at this years before anybody else was in social science.
But one of the things about that approach is when you look at the nearest neighbors for those words, they're all extremely noisy. Cause it turns out there's actually just not that many instances of the words she wants to compare. Which is fine, right, I mean, that's the nature of historical documents.
But what we're able to show is, even when our results are very similar to her validations, you get these very rich sets of nearest neighbors that give you a much clearer sense of what's going on. And that's partly because you're leveraging the pre existing embedding base.
>> Justin Grimmer: Steve, do you want to hop in?
>> Steve: Yeah, so, fascinating talk and paper. So, one thing that leaves me groping the multivariate outcome vectors seems a little unsatisfactory because it, in a particular application, might not lend itself readily to easy interpretation. So take your EU example, the way it's discussed by the UK parliament. Here's how I would have been inclined to do it, and I wanna get your reaction as to whether this is better or worse than what you did.
So I'd get the embedding vector and transform it using the a matrix exactly as you did for each instance of the EU. Talking about the EU. And then I guess, I just look at the sentiment score. So I take the transform vector and find the nearest neighbor, or some set of nearest neighbors by some criterion.
And then I just calculate some sentiment measure for each of those nearest neighbors, average them, okay? That gives me then a single essentially, what's the positivity or the negativity of the sentiment of the words that surround the usage of EU? That's my left hand side variable, regress it on whatever you're interested in.
That strikes me as both simpler than what you actually did and easier to understand the outcome. So, what do I lose by doing that? I realize it shares, I'm building a lot of what you actually said, but it just seems like I've tailored the outcome variable for the application of interest, and that strikes me as probably a better way to go in many respects.
You can make similar comments about your partisan differences regarding the use of word immigration. I think much of that is basically about, do you cast it in a positive or negative light? In your empire example, I think there's probably more than just that going on. It's not really positive versus negative, but the whole set of things that are invoked by the meaning empire, my guess is changing.
So that's probably gonna be harder to reduce to a one dimensional outcome without really doing serious violence to what's happening in the data. That's just kind of a reaction to the way you applied the methods, and I wanted to get your thoughts on that.
>> Arthur Spirling: Yes, for what it's worth, I mean, your description of the sentiments approach isn't far off what we do.
So the intuition is not wrong. So what we do is we say, okay, look, here's an embedding of the EU and let's see all these conservative MPs using it, right? How far is their average embedding from the embeddings of positive words versus the embeddings of negative words and take some averages and so on.
So the broader question in this literature about, what is just, we have a differential understanding of this term versus we have different sentiment towards, it is a bit murky? And there is a big ish sort of sub literature now on kind of what do we mean by sentiment?
What is sentiment relative to just having a different understanding of it? But what it's worth, so very clear when we say understanding, what we mean by is, you just use it in some different way. So it is not that, we literally do not understand each other when we use the term abortion, it's like the entity, this thing.
We could agree, we understand what it is, we understand what marriage is, we understand what immigration is. It's just that our use of that term differs and that's what we are trying to pick up on. And then we're inferring that that difference in use is a difference in meaning and we're doing it via this distribution hypothesis.
But yes, just to reiterate, it's not far off the way that you are thinking about the sentiment problem and that's sort of how we set it up. But it's very murky, I think, in this stuff about what is sentiment? What is just, yeah, what is, we have a genuinely different understanding of the term.
>> Steve: Yeah, that's right. That thought also occurred to me as I was listening to you talk. But the immigration one in particular, in political speech, you guys are the experts, not me. But my sense is, in political speech, the word immigration is often used to invoke either negative or positive sentiments in the audience.
So there it really is about, it does seem to me about sentiment more than meaning. But I agree in some of these other settings, just the meaning itself may be what changes rather than the emotional response that's invoked.
>> Arthur Spirling: Yeah, I mean, just to give an example of what I think is a real meaning change rather than a sentiment change, right?
So, some of these terms, these other projects we've been doing, I've been doing on say, how a term like rights evolves, that's probably not a sentiment issue. That's probably literally a question about what we include in this class of object we call right. And obviously for the great majority of history, it doesn't include, say, the right to vote.
And then suddenly we think, yes, rights covers this other thing now. And it's not necessarily a sentiment, it's genuinely including something new. But yes, it's a tricky issue, I'll just say that. And we don't resolve it for sure in this paper, yeah.
>> Justin Grimmer: Okay, can we give Aaron the chance to ask your question, please?
>> Aaron: Hi, guys, really nice to see you. Congrats on a really interesting paper. So what I kept on thinking about was if maybe this is an extension or maybe it's something that you can already implement in your r package, which I haven't tried yet. But what about looking at these terms in a network context?
So, for example, go to the EU, I'm not asking to go this time, but thinking about the EU plots here. So looking at education, where conservative backbenchers are very negative about education, then when they're in the office, then they're suddenly more positive. Another explanation for that could not be that they just, they're saying they're doing a great job but really looking at the government action words, right?
So now that they're responsible for policy, they're not really talking about education the same way they're talking about the policies that they're implementing rather than criticizing what the opposition did. So to me that sort of, I would wonder if you could specify as a user a third term, like government action or a third topic concept sort of thing.
Or more inductively see how these associated terms change, and sort of maybe plot that in some sort of interesting way. So I'm curious if you've thought about that and what that might look like. It makes me think of sort of A dynamic topic model, but for individual words.
I'm just curious about how you might think about applying this in sort of a network context in which you're thinking about more like a broader set of words of interest. Thank you, this is really great work.
>> Arthur Spirling: Thank you, Aaron. I'll let Brandon chime in a minute. Thanks, Aaron, very nice to hear from you.
So quickly, can you look at the words around this particular embedding? Yes, and actually, as with reference to Justin's question, the way to do that probably is to look at the nearest neighbours. So probably right when they're in the early days when they're criticizing the Labour government for education, they're moaning about their perceptions of various scandals.
Or some issue with national exam achievement or something like this. And they flip to the nearest neighbors become things like we're going to think about education as endowing our citizens with these new ideas and so on and so forth. And so they're actually sort of, the nearest neighbors are changing because they're just talking about it in this fundamentally different way and then by extension they have some other sentiment towards that.
I think that's probably right. This actually cuts into a big issue that we have discussed as co-authors. Which is that, when you're making a claim that your understanding of a term is changing over any historical period, you need to believe that something is pinned down, right? So if we want to say that a word has changed in meaning, like these MPs are talking about education in a different way, you have to believe that other things are somewhat constant.
I don't know if Brandon wants to get in on that one because he's given that some thought in the past.
>> Brandon: Yeah, no, no, I completely agree. I mean, by the way, hi, Erin. Good to see you. Yeah, I think that that idea particularly of like visualizing this like evolving set of neighbors is a really, really good one.
And I think I'm trying to imagine it. It's much easier to imagine it's like an animation than like a thing on a page. But I think the connection to the way they do the visualizations and the dynamic topic model work by like Blyon Lafferty is actually really good invocation.
I kind of wish we had done that in part because it would really give you a sense of how those neighbors are changing. And that is the real power here, along with being able to drill down to particular instances and read exactly what is happening in these particular uses of the words.
Yeah, I agree with Arthur's point that this does raise and is a good reminder that things, you have to believe that something is pinned down. And one of the things that is, I'll just comment on that. That is good and bad about our framework is that by pinning yourself to the original set of embeddings, you are pinning yourself to whatever the meaning is of the words in that particular context.
And so that's sort of what is held fixed, and that gives you a way to interpret what's happening as opposed to everything moving together. But it also has the problem of you become in some sense chained to whatever the meaning is in the particular worldview you're looking at.
Now, in fairness, I think what's good about doing that with contemporary English is the suspect that's also what's happening with our readers, right? When we say that it means something like blank, presumably they're imagining in their minds the sort of contemporary meaning of that. But, yeah, I really do like this more explicitly leaning into the fact that you have a little unweighted network here at any given time point.
>> Justin Grimmer: So, I have a question about perhaps studying something like discussion dynamics. So for example, you're studying something like an online discussion, and there's Godwin's law. Eventually, if people get in an argument, they're gonna mention that someone's the Nazi, and then that sort of derails the discussion. And you can imagine studying that as like a pre post using exactly what you've done already.
Would it ever be useful to try to characterize the class of things that are Godwin's law by putting on the right hand side language embedding. There's one explicit statement, someone is a right wing fascist and that could derail a conversation. There's a lot of ways to phrase that, perhaps recently found on Stanford's campus, and you might want to incorporate that broader thing and not have the analysts put their thumb on the scale in that characterization.
You can imagine lots of other things like this where a certain topic could come up, sentiment could be expressed in a conversation that could lead to a different dynamic. Is there something in the embedding that can help with that? Or do we still want the analysts to do the characterization on the right hand side?
>> Arthur Spirling: It's a very interesting idea, yes. So firstly, yeah, I want to say that it's very obvious to me that the social center of gravity of the United States is obviously moving west. When I first arrived in the US, I would open up the New York Times and have to read about the internal campus politics of Yale.
And I'm so glad that it's moved on. Opening up the New York Times and reading about the internal campus politics of Stanford now. Thank goodness. So, just to clarify Justin, you're interested in something about the idea that the way that this term is being used is moving towards other terms.
Is that the idea, just even within a conversation, something like that?
>> Justin Grimmer: Yeah, so you can imagine your elect, almost like your election example, where an event occurs in the. Then there's like pre post change. You can imagine in a conversation, somebody brings something up and then the whole dynamics of the conversation changes.
Obviously I can encode that manually, but I wonder if we'd ever want to use the actual language in the conversation to discover a disjuncture or something like that.
>> Arthur Spirling: So I'm just thinking about sort of the easiest case would be something like, yeah, there's a disjuncture where the way that they talk about this one term, and by extension its nearest neighbors has moved.
That's a very compelling case. I guess the trick would be a conversation rate, whether you get enough data to make it a reasonable inference problem. I don't know, Brandon, if you have thoughts about this.
>> Brandon: Yeah, I think this is very cool. And I think one of the ideas in the codec at all piece that we're building on, that I did with Sanjiva roars group here, was a sort of document, a very simple document embedding, that you just sort of embed all the unigrams with ala carte.
You bet, all the bigrams. You embed all the trigrams and then you just average the unigrams. You average the bigrams, you average the trigrams and you concatenate them together. So if you have like a 300 dimensional embedding, you end up with like a 900 dimensional document embedding. I think just doing that for sentences and then checking sort of cosine similarity to these words of interest, like fascism, etc, would do that.
Actually very. Easily. And it would be super interesting to see if there's like a structural break there. But that's much more like less word specific focused and more document representation focused. You could also imagine, I mean the good thing about like a straight up language model approach would also be you could think about.
Like what's the probability that some set of words about fascism are getting invoked given the language that has appeared thus far. Which is another kind of interesting way to pull the probabilities out of these language models and turn them into something of it.
>> Justin Grimmer: Okay, so we have a couple questions on the Q and A that have popped up.
Could we give Elizabeth Elder the chance to ask her question, please?
>> Elizabeth: Hi. Thank you. So I'm also thinking about this case of pre, post some events changes in the way people are using language. And I'm just thinking about the, you mentioned that you're looking at this data from the 1700s.
Using data from today to make the larger set of embeddings is not going to work. But I'm thinking the meaning of words changes also on kind of a shorter time scale. And so to the extent that you're looking at changes in language pre, post some event, while meetings might also just be changing secularly over the course of that time, is there a way in which kind of using the same set of embeddings to train to understand the meaning of the words pre and post the event might kind of shrink or suppress differences pre, post the event that you're caring about?
It sounds like maybe it's just not that sensitive to the kinds of changes that happen on a 10 or a 20 year time scale. But just curious to hear a little bit more about whether we might be kind of suppressing treatment effects by using the same corpus to train that.
>> Arthur Spirling: It certainly seems plausible to me. I mean, what it means to say to pin any of this down, you need to sort of, something has to be fixed, right. And so you think about what are we doing under the hood? There's some interruption, right. And before this in the pre-period we're taking some average.
Right. And then in the post period we're taking some average. And so you need to believe that the thing that we're taking the average over sort of makes sense in both periods. So in that sense, absolutely, you need to believe that something is pinned down. And I think it probably is fine for most of our applications in things like politics, but is maybe very undesirable in some rapidly moving areas where people are talking about social groups or social change or something like that, where there are multiple terms within that social group discussion which are themselves changing in terms of their meaning.
And that would cause a lot of problems for us, because then you're taking the averages over the wrong thing, basically. But yes, it could be. So it certainly could be suppressing the effects. I don't know if Brandon has sort of, Brandon's have more thinking about that problem.
>> Brandon: Yeah, no, I think this is a really great question.
I mean, I guess what I would say is that even when language is changing extremely quickly, so much of the language is actually staying approximately fixed that I think it more or less works out. So, like, we had an example in the paper where we're looking at the difference between Trump pre election and Trump post election, where the sort of sense in which the name is being, I mean, same person, right.
The person was a candidate before, right? It's just like we're just sort of changing the distinction between how he's being talked about before the election and after, and even that's sort of getting picked up. We have no basis in the glove data, right? Trump is not even really on the political landscape at the point when that is being trained.
Right. Cause it's all like 2011, 2012 and I mean, he's doing political things, but he's not a political candidate. And yet we're still able to pick all of that up, essentially, because everything around it is still all these words about presidents and presidential candidates and things like that.
And so I I think even in settings that are really rapidly changing, like language about the Internet or things like that. It's just so much of the language is staying the same. We just don't think about it very much because, like, we know what it means, and so we just move on.
>> Arthur Spirling: I was gonna say just maybe one case, which could be very complicated, actually. Maybe not in terms of over time, but moving across disciplines. So if you were trying to compare, you have two terms in statistics that mean this, and that's not what they mean in economics or something like this.
I can imagine that could be very complicated in practice. You're trying to translate between two different disciplines. That could be. Yeah, I could imagine what it means for something to be identified in statistics versus economics. And then you have all its other nearest neighbors built around that, but that itself is different between the two disciplines.
So it could, yeah, that could be tricky, actually. Yeah, yeah.
>> Justin Grimmer: Okay, so with that, I think we've reached the hour. If everyone can stick around, we're going to stop the recording, and then we can continue to ask Brandon and Arthur some questions about this fascinating paper. Arthur, thank you so much for presenting it.
>> Arthur Spirling: Thank you, Justin. I appreciate this as well.
Arthur Spirling is professor of politics and data science at New York University. He received bachelor's and master's degrees from the London School of Economics, and a master's degree and PhD from the University of Rochester. Spirling's research centers on quantitative methods for social science, especially those that use text as data and, more recently, deep learning and embedding representations. His work on these subjects has appeared in outlets such as the American Political Science Review, the American Journal of Political Science, the Journal of the American Statistical Association, and conference proceedings in computer science. Substantively, he is interested in the political development of institutions, especially for the United Kingdom.
Brandon Stewart is associate professor of sociology at Princeton University, where he is also affiliated with the Politics Department, the Office of Population Research, the Princeton Institute for Computational Science and Engineering, the Center for Information Technology Policy, the Center for Statistics and Machine Learning, and the Center for the Digital Humanities. He develops new quantitative statistical methods for applications across the field of computational social science. Along with Justin Grimmer and Molly Roberts, he is the author of the 2022 book Text as Data: A New Framework for Machine Learning and the Social Sciences.
Steven J. Davis is senior fellow at the Hoover Institution and professor of economics at the University of Chicago Booth School of Business. He studies business dynamics, labor markets, and public policy. He advises the U.S. Congressional Budget Office and the Federal Reserve Bank of Atlanta, co-organizes the Asian Monetary Policy Forum and is co-creator of the Economic Policy Uncertainty Indices, the Survey of Business Uncertainty, and the Survey of Working Arrangements and Attitudes.
Justin Grimmer is a senior fellow at the Hoover Institution and a professor in the Department of Political Science at Stanford University. His current research focuses on American political institutions, elections, and developing new machine-learning methods for the study of politics.