Daniel J. Hopkins speaking on The Rise of and Demand for Identity-Oriented Media Coverage.
The Hoover Institution hosts a seminar series on Using Text as Data in Policy Analysis, co-organized by Steven J. Davis and Justin Grimmer. These seminars will feature applications of natural language processing, structured human readings, and machine learning methods to text as data to examine policy issues in economics, history, national security, political science, and other fields.
Our 16th meeting features a conversation with Daniel J. Hopkins on The Rise of and Demand for Identity-Oriented Media Coverage on Thursday, February 16, 2023 from 9:00AM – 10:30AM PT.
>> Justin Grimmer.: Hello, everyone. Welcome to the Hoover Institution workshop on Using Text as Data and Policy Analysis. In this workshop, we feature applications of natural language processing, structured human readings, and machine learning methods to text as data to examine policy issues and economics, history, national security, political science, and other fields.
I'm Justin Grimmer. Steve Davis and I co-organized the workshop. Today, we're thrilled to have Dan Hopkins, professor of political science at the University of Pennsylvania, here to present his paper, the rise of and demand for identity oriented media coverage. Just some quick ground rules. Dan's gonna speak for around 30 to 40 minutes.
If you have questions, please put them in the q and a feature. Steve and I might interject with some pressing questions, and once you put the question in the q and a, we may recognize you to ask the question live. After about an hour, we're going to turn the recording off, and we'll go to a more informal q and a session where we can ask more nuts and bolts style questions about how Dan and co authors produced all the very interesting estimates that they're going to present to us today.
So with all that said, Dan, take it away.
>> Daniel Hopkins: Amazing, I am really grateful to be here. Thanks to Steve and to Justin. Thanks to Justin for the introduction and thanks to the audience. So let me share my screen. And so this is joint work with Yph Lelkes and Sam Wolken, who is a joint PhD student here at the University of Pennsylvania with the Annenberg School for Communication.
And this paper starts with a motivation that you see initiating a lot of papers and a lot of research in recent years. There is a model in political science that is arguably the dominant or certainly one of the main models of how political scientists think about the interaction between politicians and political elites more generally and voters.
And that model holds that politicians can build coalitions, can build support by altering the salience of a critical issue. This is a model that has been seen in various forms, in various research in American and comparative politics for generations. The presumption built into this model is typically that issue salience, which is the critical parameter, when we're thinking about different political equilibria, that issue salience is set by politicians.
That politicians, through their rhetoric, can raise the salience of certain issues. So, for example, Donald Trump, in his 2015, 2016 presidential campaign, raised the salience, this model might hold, of the immigration issue, and so took a set of voters who otherwise might have voted for the Democrats. And by elevating the salience of an issue that was important to them, brought them to vote for him.
And one of the core presumptions of this paper is that while those models are incredibly helpful in thinking about the interaction between politicians and the voting public, that issue salience may be a product of the media environment as well as politicians. And in fact, it may be a product not only of the media environment, but an interplay between the media environment and consumer behavior.
So in some ways, I see this paper as trying to bring together pretty disparate literatures in the political economy of the news media and political psychology. Our goal more narrowly is to investigate how changes in the media environment relate to the salience of social identity cues in media content.
Now, social scientists are not the only ones. We're indeed maybe not even the primary ones who are doing research as to what generates attention online. Companies themselves, whether it's media outlets or social media firms, are very interested in what gets people to pay attention to news, what gets people to pay attention to other content online.
And so today's media environment has been widely characterized as a highly competitive one. And our starting point is the fact that a highly competitive media environment, coupled with a shift towards the digital dissemination of news, creates a whole set of new opportunities for customer research. So I grew up in the 1980s, in the early 1990s, and at the time, I mostly read my local paper to figure out how the Mets were doing.
Usually, you know, actually back then they were doing pretty well. But the critical point was that my local newspaper didn't have that much capacity to know what I was reading. Was I reading the comics? Was I reading the front page? They just delivered me a paper, and they could do user surveys, they could read letters to the editor.
But their opportunities for direct audience feedback were relatively limited. Nowadays, however, if I were to log on to, if that local paper still existed, I could log onto its website. But when I log on to, say, the New York Times or Fox News, they can, in a very, very granular way, track precisely how long I stay on an article, what kinds of headlines I am likely to click on, and they can then iterate.
We are in a system in which news outlets have a tremendous amount of information about what does or does not get people to click, what does or does not get people to engage. And one of the things that we want to ask is then, how is this changing environment going to influence the kinds of news that are likely to be provided?
So what is it that news outlets are learning about? The engines of online engagement. And we're gonna focus in this research on the role of core social identities, by which I mean social identities such as gender, race, ethnicity, political partisanship, okay? And so our question is, to what extent are media outlets learning over time that coverage, which emphasizes social identities, is more likely to generate engagement and then altering, potentially altering their coverage as a consequence?
And it's really important to emphasize that there are multiple ways in which media outlets may respond to this kind of knowledge. That is, social identity content can shape, it can shape the stories that news outlets cover, it can shape the content that they're covering, or it can shape the way in which stories are framed.
You could imagine a story about a soccer team which either does or does not play up certain social identity dimensions. Or you can imagine the choice to cover an identity-inflected issue over another kind of issue. And I'm gonna be attentive to moving back and forth between those different research questions and that they imply different counterfactuals, which we need to be clear about.
So we start then from the more general question, is identity inflected content more likely to go viral? Ezra Klein, in a 2016 interview with Tyler Cowan, had said, what is mattering is you're using outrage people have, because of a shared identity, to send something viral through a sharing mechanism.
And you can see here, for instance, both on the left with Fox News and on the right with MSNBC. Observationally, it seems that in recent years, outlets on the left and on the right have emphasized social identity, latent stories. So, for instance, on the left, you see the story of an unauthorized.
Authorized immigrant who was alleged to have killed, and I think later found guilty for killing an Iowa woman. On the right, you can see that the GOP says that Kamala Harris doesn't care about white people after Hurricane Ian. So both of these are stories that potentially could generate audience engagement on the basis of their social identity content.
Okay, and Ezra Klein is both a journalist, but also is the co-founder of Vox, is a participant in the system. Who would certainly have access to the kind of information about the stories that an outlet like Vox is or is not promoting and what they're learning as an online media provider.
So I have a descriptive question here, which is, has news media coverage of core social identities increased in recent years? And then from there, and the answer to that is gonna be yes. A second descriptive question, which is, does news media coverage highlighting social identities on social media?
And that's obviously not all of news media coverage, but it's some. Does it generate additional engagement? And we're gonna find there that the answer is often it does, but it's gonna vary a little bit depending on the specific outcome that we're looking at. And then finally, we're gonna get to a causal question.
Do headlines with explicit social identity cues generate more clicks than headlines for the same story without such cues? This is a more precisely defined causal question. And here, you'll notice we are focused on the framing of the same story. So the upworthy website decided to cover a given story and then varied the headline with which it promoted that story.
And we're gonna use that variation to see if explicit social identity cues and headlines generate more engagement, and if so, under what conditions. And then we're gonna find that often, they do. Okay, so I'm gonna start by briefly laying out hypotheses. Again, the broad intellectual goal is to combine the political economy of news media research with a pretty separate research on the political psychology of news consumption.
I'll present some observational evidence from 6.2 million tweets. And I'm gonna briefly make reference to some of the work that, and I'd be happy to go into this in the question and answer that we've done using Facebook data via the Social Science One collaborative. I'll then present the experimental evidence that we have from Upworthy.
And one of the core take homes is that a headline we find, looking at that experimental evidence, that a headline with an identity queue is gonna garner about just under 0.8 additional clicks per 1000. Now, that doesn't seem like a lot. But it's worth noting that stories that have, say, the word Obama in them generate about roughly the same order of magnitude, additional clicks relative to stories that don't.
Okay, and then I'll conclude, talk about next steps. I really look forward to people's feedback. This is an ongoing project, as you will see, it's a pretty high level project, meaning that it could give rise to multiple papers. And I'll be very curious to see the directions in which people want to push the research.
Okay, so with respect to theoretical background, there's longstanding evidence that media outlets respond to market pressures. That media outlets are, at least in part, market actors, like many other kinds of firms. But what's new, and I alluded to this earlier, is the timing and the granularity of consumer feedback.
The fact that a nightly news program in the 1990s had only limited ways of knowing how many people were watching at a given time. And in particular, what elements of the newscast people were tuning into. When did people look up from what else they were doing? Whereas now, we have very, very finely grained data for those people who consume news in digital formats about exactly how they engage with that news.
And that's gonna give new opportunities for customer research and potentially for then dynamic processes. Wherein media outlets change the provision of media content to try to maximize engagement, readership, or the like. I do think that there are really interesting questions about what is it precisely that media outlets wanna maximize, I mean, ad revenue, in short.
But how they do so is an open question. What function of different forms of engagement generates ad revenue is yet another question. So websites are very helpful in that they produce real-time story level feedback via clicks and social media engagement. Which can provide a lot of information to content producers, to editors about the kinds of information that are generating engagement.
And you can see here on the bottom of the slide. The New York Times, in fact, has had a trending column, where not only are they aware internally of what stories are trending. But they use that and social proof style mechanisms in order to further increase people's engagement in trending news stories.
So then the question becomes, well, what feedback, for instance, does social media or the digital provision of news more generally provide to media outlets? And one observation that we start with is Ezra Klein's observation that identity-related content. And maybe particularly certain kinds of identity-related content, maybe threats to people's core social identities, are one potent mechanism through which to generate online engagement.
Now, why is it that social identities would be a particularly valuable channel through which to generate engagement. Well, in part, social identities are the part of, so I just wanna define social identities as being the part of our self-concept, as individual, self-concept that's informed by group membership, right?
The part of my self-concept that is informed by living in Philadelphia or being a Mets fan, or the combination thereof, which is dangerous. But again, that's an aside. Now, why might social identity cues be a vehicle for engagement with news media? There are a whole set of reasons, right?
Social identities tell us what information we should pay attention to, how group members should act. They have a normative element to them and whom to trust. So I can use a partisan social identity to figure out of the gauntlet of news that I got about, say, Nikola Sturgeon's resignation yesterday.
Which sources should I trust? How should I make sense of that information? And in a very, very information-rich environment, social identities provide a useful way to filter out the mass of information and to figure out what information is relevant to me. What information tells me how, given a certain social identity, say, as a political scientist or as an academic, how should I behave in certain settings?
There's a lot of information and a lot of heuristic value in social identities, okay? And it may also be that the performativity of social media has the potential to amplify identity cues. There's extensive research suggesting that when we engage in social media, to the extent that we do, we are in part performing certain kinds of identities.
And of course, social media are, to greater or lesser extents, a public, or at least quasi public Environment in which I am not only consuming information for myself, but I am using that information to signal things about myself, right? So it's worth being attentive to dynamics both with social media outcomes that can be observed by other people.
Such as when I share a story, and those that can't be observed by other people, such as when I just click on a story. Okay, so the hypothesis we're gonna start from is the idea that news that highlights core social identities is more likely to generate engagement. And we're gonna be focusing on media, provide news media content provided through social media.
And then our broader research question is, well, if this hypothesis hold, how, if at all, do media outlets learn and adapt to it? And it's of course possible that the hypothesis could be wrong. But if media outlets, if Ezra Klein generalizes, if media outlets in general think it to be the case, they could still act on it, even if it does not.
As social scientists, we don't think that there's a causal effect there, okay. My emphasis here has been on the use of. So this is like a 35,000 foot analysis in the sense that we are gonna take a very, very high level approach. You could write many, many papers on a specific category of identities, or on, indeed, specific identities, the way in which African American identity is invoked in news media.
And the differential responses among african american, white, latino, Asian American, American Indian respondents. So this is a very, very high level approach, which I hope will give rise to or help nudge along more identity specific studies. And also, our emphasis here is on the identity cues in the media.
But these are, of course, gonna interact with customers identities. And so if you are Fox News and have a certain image of your archetypal customer, you are likely to cover identity issues differently than if you are MSNBC. Okay, so one of our first data sources is gonna be 6.2 million tweets.
And I should add here that we are tremendously indebted to the fact that Twitter used to provide such data for free via an API. And I'm distraught that moving forward, such data are not gonna be available. We downloaded 6.2 million tweets from 19 prominent media outlets between 2008 and 2021.
We identified these 19 prominent media outlets to be a wide range of online outlets, such as Politico, print outlets such as the New York Times. Outlets that are national, the New York Times, Wall Street Journal, and those that are more regional, like the Philadelphia Inquirer right here in Philly.
And those of varying political slants from something like the Huffington Post to something like Fox News or the one American News Network. And what we're gonna do is we're gonna both hand-code fairly large samples so that we can make statements just about the hand coded data. And then we're gonna try to use BERT, which is a transformer based classifier, to extend our data set.
But we're also attentive to the fact that there are gonna be certain limitations in that. And so we want to make sure that our core inferences hold just with the hand annotated data set alone. The way that we generated the hand annotated dataset is that we did weighted sampling within year to make sure that we were.
Even though we have more tweets in certain more recent years, that the volume of tweets has grown. We wanted to make sure that we had reasonable samples from each of the years covered in our dataset. So we then annotated, and we initially worked with ourselves and undergraduate research assistants to try to identify categories.
As often is the case, our initial ambitions as to what could be annotated and our final coding scheme were not identical. There are a wide range of interesting categories, such as, say, like identity threat, that we could not reliably annotate. And so what we wound up doing is really focusing just on does a given tweet say.
This one here is a Wall Street Journal tweet about a Georgia farm owner who sees GOP champion policies as better for her. Bottom line, I'm still gonna say democratic, mostly because I'm black. So there's a clear invocation of both a partisan identity, but also a racial identity. So one of our categories was race ethnic, where we would denote something as one.
If a tweet makes explicit reference to a group identity based on race or ethnicity. We had similar categories, for instance, for partisanship. You'll notice that these are conservative definitions, so we would only call something political if the tweet explicitly made reference to a partisan group. Now, of course, Donald Trump and even Mitch McConnell are partisan actors, and many of the people who are exposed to these tweets are gonna know that partisanship.
But we used a conservative definition of really focusing on cases where the identity is explicit in text. That said, we also did code an implicit racial or ethnic category where we drew upon prior research to identify certain political issues, immigration, welfare, criminal justice. Which have been commonly and consistently associated with racialized groups, and we coded those in a separate category.
Okay, then our goal in using BERT was to further extend the data set to enable more granular analyses. We were able to hand code roughly 9000 tweets, but we wanted to be able to then extend our analyses further. You'll see that some of these categories are just come up really pretty infrequently.
We wanted to see what are the capacities of a transformer based classifier to help us extend our dataset. We chose BERT after a set of competitive trials. As somebody who had initially done this work, this kind of work a decade ago with classifiers like SVM. I was struck by BERT's significantly better performance than, say, support vector machines.
It's a highly flexible learner. I'm excited to talk more in the Q&A about folks reactions to it. So one of the first things that we did then is we wanted to make sure that BERT, we wanted to see how well BERT could recover these categories. And we have both the Twitter data I've already spoken about.
And also we have about 550,000 blurbs from these same media outlets that were posted on Facebook in a shorter period of time period, from about 2017 to 2021. And I'm gonna be talking about the analysis of those Facebook blurbs as well, even though this particular talk really emphasizes the Twitter data.
So with respect to so what we did then is we tried out a platform-specific classifier where we would just use data from that platform, where Facebook is a platform and Twitter is a platform, and we. Also then separately tried pooling the data, saying, hey, maybe there's going to be information in the Facebook blurbs, that ports over to the transfers over clearly to Twitter.
What you can see here is how both the platform specific Bert model and the model trained on pool data performs, and then we just selected in whichever case the F one score was better. It's worth noting, though, that in some cases, so religion. There are relatively few tweets in this period of time that focus on religion, relatively few stories in our sample that focus on religious identities.
And so, we ultimately, we were not satisfied with the platform specific data there. And you'll notice, too, that for religion on Twitter, neither the F one score when we use the platform specific approach, or the F one score when we use the pool data really does well enough for our standards.
So I'm only going to on religion, I'm only going to report hand coded data, because we're not satisfied that our classifiers can tell us enough about the overtime trends in the religion on Twitter category. But in general, we note pretty high performance. It's worth noting that that performance varies a little bit depending on the platform, depending on the classifier.
But overall, we certainly thought that these levels of performance, so we would choose in each case which is the higher. So, for instance, on race and ethnicity on Twitter, the platform specific training set produces a somewhat better classifier, so, we went with that. Okay, so what do we find?
We do find a rise in identity content across media outlets. And again, this is not a causal claim. I am simply documenting, descriptively, something that I think many people who've watched American politics closely would have guessed, which is that there has been a rise in identity content across media outlets and across platforms that's been most pronounced in recent years.
Now, of course, could this be driven by events, is this me too, is this George Floyd's killing? Yes, undeniably, on partly. In this descriptive analysis, what we want to do is just lay out the broad trends over time, and then we can try to probe some of the mechanisms under.
But to the point, is this just a political economy story in which media incentives are changing? No, it's not that, but it certainly could be an interaction of changing media incentives with the particular issues that are salient at a given the events that happen. We're also going to find that identity content is often, not always, but often associated with increased engagement.
And I'll get into both those observational associations and some of the upworthy evidence that may be more credibly causal. Just as a brief reminder, the several years that we're studying have been a period of time in which there's been a lot of identity related news. So on the bottom, for instance, you see a number of stories related to racial identity in particular, and on the top, you see a number of stories related to gender.
It's worth keeping in mind, for instance, that me too, though the hashtag had existed for years, really rose to prominence in late 2017. And of course, George Floyd's killing, was another kind of watershed moment. But as well, the election of Barack Obama, there are a lot of events on which media outlets, if they are interested in covering identity oriented stories, can do so.
So here, what we show is the trends for each separate outlet over time and these are mean centered. So that if you notice, as is very, very clear in the race and ethnicity category, but is certainly statistically distinguishable for gender and for political identities as well, the lines tend to be above zero later in the period, suggesting overall that there has been a rise in gender.
It seems to, maybe happen a little bit earlier. The pattern is perhaps clearest with respect to race and ethnicity. But I think another thing that's worth noting, and I'm again happy to talk about this in the Q and A, is it's not specific to any one single outlet or even a type of outlet.
When we approached this, we thought that Fox News and the New York Times might well approach identity oriented stories very, very differently. I had certainly noticed from, say, my phone just scrolling at the news app that many of the stories that my phone was trying to get me to read on Fox News had identity content of some form or another.
And it's noteworthy that, in fact, if you could just compare the New York Times and Fox News, the New York Times has more explicitly identity oriented stories in this period of time than Fox News does, okay? You can also look at specific outlets, and you can see across our range of outlets.
By and large and so the pink or the salmon color indicates the bird annotations, whereas the blue indicates those that were human coded. The core take home point is that if you look, the zero line in both cases tells us that most of the dots are to the right, meaning that if I give you two tweets, one has social identity content and one does not.
For most doubtlets, whether you're using Bert or just the annotated subset, we find that observationally, that tweet that has social identity content, is going to get more engagement than the tweet that does not. Now, of course, they differ on a whole bunch of different dimensions, right? Social identity content is not randomly distributed across the issues that newspapers are writing about, or news outlets are covering.
But we find this instructive nonetheless, right? And though you may, and you should look at this as social scientists and say this is not causal, it is nonetheless the kind of information that would be readily available to a digital content editor. And so it may be actionable irrespective of whether it's causal.
We can also generate statistical models, so here, for instance the outcome is logged favorites. So for people who are on Twitter, when you see a tweet and you decide to affix a little heart to it. And one of the core things that we can see here is that, again, observationally, when we include year fixed effects, when we include outlet fixed effects, we find that those tweets that have social identity content are more likely to.
And the sample size here tells you that this is just a human coded tweets, more likely to generate those favorites. They're also more likely to generate retweets, right? And you can notice that this is particularly true for, say, the religious tweets that have religious identity content, but is broadly true across a range of different kinds of identity.
Okay? An observational association, but one that may be important precisely because it may be actionable to news editors. Okay, well, that all said, Twitter users are a distinct minority. Even before the latest changes in the Twitter ecosystem, they are a minority of people. They skew highly educated, they tend to skew democratic, not exclusively so, but more so.
So we then wanted to balance this analysis by also using social science one to access shares on Facebook. And the Social Science One platform is an interesting collaboration between social scientists and Facebook, which many of you folks may know a lot about. But I'm happy to talk more about our experiences in the Q and A.
But one of the challenges here is that unlike in many other data analyses, we as researchers cannot work directly with the data. So the data are protected from us and they are provided in an aggregate form. That limits some of the analyses we can do, as well as some of the initial data checking that would be part of any social scientists kind of scouting of a project as you're diving into a data set.
So here, as I mentioned, we have about 550,000 media URL's. So when you put a media URL into Facebook, there's a suggested blurb that pops up, and we're analyzing those blurbs to see how the content of those blurbs relates to sharing behaviors and posting behaviors on Facebook. And we're gonna be employing corrections for differential privacy.
One thing to note here is that because so differential privacy is going to replace actual totals with model based estimates, which can sometimes be negative. So you can sometimes see that a newspaper article was shared with negative six people. That is partly important because you would looking at these kinds of data and naturally think about logging it.
But at least in its vanilla form, you can't do that here. So here, what you can see, again, a shorter timeframe, but you can see some of the trends that are evident on Facebook, and to a first approximation, they reinforce the trends that we saw on Twitter, despite the different user base.
So you can see that Donald Trump's election is associated with a spike in uses of political identities, that me too, in late 2017 is associated with a spike in mentions of gender oriented identity, that the killing of George Floyd and the subsequent protests were associated with a spike in the use of racial and ethnic identity.
So by and large, this validates, and I'm happy in the q and A. We've done a whole set of analyses, because one way in which the Facebook data is better than the Twitter data is that we can do subgroup analyses based on, say, users-imputed political ideology. And the story is not always consistent across outcomes.
It seems that certain kinds of social identity content, when posted on Facebook, generates, say, like shares, but not actual clicks. And our running explanation for that is that there's some social identity content that you may see and know, hey, this is good for my side, or this is bad for my side, I want to share this, but you may be sufficiently familiar that you may not actually need to click on it.
And that raises an interesting question about what precisely are media outlets incentives? What are social media firms incentives when it comes to this kind of engagement? Okay, so identity-oriented tweets increase favorites by around 20%. We find similar but somewhat stronger results when we look at retweets as the outcome, right?
But one critical question to ask is, well, what is the implied counterfactual here, right? What is included in the all else equal? And as I've been motivating as we've gone along, I mean, this is important, I think, even just observationally, because it's information that's available to the media outlets.
But nonetheless, the inferential target, you may ask, well, what is the inferential target? What's the counterfactual that we're talking about? And so that makes it useful to hone in and to focus on one particular part of this broader story, which is the question of does the same story, when promoted with the social identity elements amplified.
Does it generate more engagement than the exact same story with a different form of promotion, okay? Here I'm going to turn to upworthy. Upworthy is, if this were an in person talk, I would ask how many people were familiar with Upworthy? It's not a randomly selected news outlet, if there is such a thing.
That is to say it has particular features. Well, I mean, I guess it would be randomly selected. But law of large numbers doesn't kick in, right? It's not a representative news outlet. They focused on positive news stories, and they were particularly prominent in the period from say like 20 13, 20 14, 20 15, 20 16.
But their goal was precisely to try to tell the news, to relay the news in a more affirmational way. And so you can see an example of what happens when you pull their page up, their homepage up, on the wayback machine. But what's really neat about upworthy is that a set of social scientists worked with upworthy to get more than 3000, almost 4000 headline experiments that they ran in the period from 2013 to 2015.
So their business process was that they would give the same story, multiple headlines and then randomly assign, depending on when you land. And every once in a while, you can catch the Washington Post or the New York Times doing exactly this, right, randomly assigning people who land on their page to different versions of a headline.
And from that, these almost 4000 stories had about 18,000 unique headlines. And what this lets us do is it lets us look at the attributes of the headline that are associated with more people clicking on the story. And so what's helpful here is we can analyze cases with and without cues for each type of identity.
That is, we can take a given story and see, well, what happens when they did highlight an identity queue and when they did not. Now, of course, other factors may vary with the identity cues, but this nonetheless represents an opportunity to test the broad claim that social identity cues may be part of the process of generating increased engagement.
So to give you a sense of what we're doing here, you can look at this table and see that in headline package one. For instance, why are 193 countries joining a feminist tsunami on Valentine's Day, right? There's an explicit reference to gender there, but a different headline for the same story.
This may be the best alternative ever to wallowing in Valentine's Day self pity, does not make explicit reference to gender, right? Or you can, see these insults are disgusting. This is the third headline package. When you hear who's saying them, it gets worse. It doesn't tell you what kinds of identities we're talking about.
But if I said instead, if the headline were, here are some insults too many black people have heard, who said them might surprise you there. There's obviously an explicit reference to a racial and ethnic categorization. So we're gonna leverage within story variation to try to look at the impacts.
And so if there's a story that either all the headlines use social identity cues or none of the headlines use those social identity cues, that's not gonna contribute to our estimates because we're looking only for variation within a given news story. So here, what you can see on the left for the subset of headlines that we have annotated ourselves, whereas on the right, those that we use Bert to annotate.
And so what would I focus on? I would certainly emphasize that in either case, we see that headlines that mentioned racial and ethnic identity generate more clicks for every thousand impressions than do headlines that do not. And you can see some suggestive evidence, say, for other identity categories.
We have a lot more statistical power when we employ BErt. So you can see with the BERT classified headlines that those that say employ gender or sexual orientation identities generate about 0.65 more clicks. Those that employ racial or ethnic identities, about 0.87 more clicks. Religious identity is 1.5 more clicks.
So across those three categories, there's reasonable evidence that randomly selected headlines with those kinds of identity cues generate more engagement. So, great. So I've reached the conclusions without being interrupted, partly as a product of the medium. So this is a broad research study, and there are many questions that you can ask, and I think there are a wide range of possible follow up studies.
But we did try to take causal evidence from upworthy to suggest one part of this broader the credibility of one part of our much, much broader explanation for the changing political economy of the news media. It's worth noting that. So as you saw, a headline with identity queue garners about 0.8 additional clicks per 100,000, and headline with the term Obama just generates about 0.86.
If you think about other terms like crime or sex, they're all in about this ballpark, okay? One of the things that I've realized in doing a lot of this research is just that people who are on social media look at a lot of things without clicking right. If you can produce an additional click per 1000, that's actually, that's not nothing.
Descriptively, we saw that news content with social identity cues is more likely to generate specific forms of engagement. Of course, we're very attentive to the fact that our observational analyses should not be used to make causal inferences, but they may approximate the data that decision makers have. And so, we think, are an important part of this broader story.
And, of course, it's not simply that the political economy of the news media is changing, but that there's a set of events in the real world, that the election of Donald Trump, the messaging streams that emerge from politicians, all of these things are changing. There are normative trade offs in the rise of identity oriented news coverage, which again, very happy to talk about in the question and answer period.
On the one hand, an attention to marginalized groups could be very helpful as a springboard to mobilization, as my colleague here, Dan Gillian, has written about, but may also provoke backlash or generate a kind of intractable us versus theme politics. Now, identity cues as a channel for virality are not just relevant here in the United States.
They're relevant in other countries. They're relevant in international relations. For those people who are familiar with the United Kingdom's Trojan Horse affair, I think that could arguably be said to fit very well within this broad framework in which certain social identity categories were weaponized as part of an ongoing political dispute.
If you look at how Russia intervened in the 2016 election, I think it's really important, and I think Justin has been helpfully clear on the point that there is not evidence that Russia's 2016 intervention was impactful on the order of shaping Donald Trump's victory. But I think it's also important that we do learn something about how Russia intervened, about what does Russia think about american politics and the prospects for using american political divisions against us.
And if you look at the posts that the Internet Research agency was seeding into different american social media conversations, they often drew on identity, whether it was southern identity, southern white identity, whether it was black identity. And so the prospect that we are certainly not the only people to be attentive to this dynamic.
Insights about the changing technology and the way that political actors can discover audience preferences, may also apply not just to media outlets, but to other political actors. You could argue that Donald Trump's approach to politics is very much about putting out feelers and then seeing how they do on social media, seeing how they do on rallies.
And so with that, I really look forward to your questions and the conversation. So thanks, everybody.
>> Justin Grimmer.: Thank you, Dan. I'll start off with just a couple questions, and then I know Steve will have some questions, and there's at least one question in the chat. So I have a high level question and then sort of text question.
So the high level question is thinking about, suppose we stipulate to all the patterns in your data, and thinking about the argument that the paper is making, potentially that the feedback from the granular information that editors can receive from clicks is then incentivizing them to take on certain kinds of stories.
It seems hard to parse that from a broader set of phenomena that are occurring simultaneously. So, for example, it is also the case that particularly during the latter time period here, there's overt campaigns on social media to encourage certain kinds of stories or to discourage composition of certain kinds of panels or stories.
So, for example, people will overtly say, this is bad, because you have potentially a male writer writing about a woman's issue, or you have a white writer writing about this issue that's about black identity. And that sort of feedback that you obtain would not be contained in the clicks and would also potentially inform not just the editor at that news outlet, but potentially many editors.
One could imagine this plus other events in the world, leading to a greater recognition of certain kinds of issues combines together, and it's hard to parse that out. So, curious about thinking about the relative weight of this granular information, or if it's sort of amplifying. And then on the text side, the use of Bert, very interesting, perhaps a little surprising, because the headlines are so short, it's hard to get a lot of text to glom onto.
And so I'm curious what a sort of dictionary style approach would look like if you sort of built out, say, 30 most used terms for each of the categories. Is that competitive with Bert? Is there something that's missing? What are the terms that Bert's grabbing onto that helps it to do better?
>> Daniel Hopkins: So these are both great questions, and I think that my answers are not super satisfying in the sense that undeniably, there are a whole range of. So on the first point, there are a whole range of things that are going on. And you're right that there's a ton of what, from a statistical point of view, we would think of as interference, right?
Lots of people are observing what's happening. Social media is a public, it's an emergent, it's a single phenomenon, right? I think one way that we could get more leverage. So, in part, I'm very curious to think through how can we parse out our particular story? The one way to do so would be to differentiate our media outlets by various factors, such as their dependence on online engagement versus other revenue streams.
So you could imagine also distinguishing them by when did they hire digital content people? Fox News actually went off Twitter for a period of, I think, almost more than a year, I think it was. And you could also imagine potentially leveraging changes in the algorithm. So, for instance, Facebook very consciously changes its algorithm in 2018 to try to downweight news stories.
And so I think those kinds of exogenous shocks to the system, well, we could argue about exogenity, but discrete time shocks to the system could provide some leverage. But you are right. I think that a next step in this could well be to try to differentiate across media outlets.
We partly initially had gone into this expecting that the kind of left right continuum would be an important source of differentiation. And we didn't find that. In fact, we found that, say, like the New York Times in recent years, and this speaks to your point, has been much more likely to use explicit identity cues than has, say, Fox News.
And in other research, attitudinal research, I've been starting to probe kind of the colorblindness, particularly on the American right. And so, yeah, I would say, yes, there are a whole bunch of things going on. I think that you could imagine leveraging things like algorithmic changes, discrete newsroom changes.
So, for instance, if you could identify when do newsrooms hire certain people, or if there are clear changes in digital strategy, at certain points, you could try to distinguish the outlets, but there's not a ton of variation. The outlets seem to be responding in broad strokes in similar ways.
Which one could then either say, okay, that is evidence of your point that they're responding more to general facets of the world. Yeah, there is, ideally the data that you would like is you'd like to sit in on meetings where they determine what's the last story we're gonna run, what's the first story we're not gonna run, and how that margin changes over time.
Your second point about Bert. It's a really interesting point. We have not compared this to a dictionary based approach, though it, in theory, wouldn't be super hard to do. We did extensive work earlier with, and I mentioned SVM work, and that level of classification was not satisfactory to make a classification-based approach worth it.
And then we were excited. We did find that, though in many tasks, I think Roberta outperforms Bert, in this case, we had found that Bert actually was doing better. But beyond that, I think you're right that in at least some cases, a dictionary based approach would be really useful.
There's, of course, the problem, particularly with race, and you saw this. You may remember your colleague Dan Jurafsky and a few other folks a few years ago had written about using word embeddings to look at ethnic stereotypes. But they very consciously did not look at African Americans, partly because the word black is a color as well as a racial identifier.
So I'd be curious how it would perform. But it's a good question, and not one I can answer right here. Thanks,
>> Justin Grimmer.: Steve. Do you have some questions?
>> Steven Davis: Well, two comments. One, and I see Aaron Carter has a similar question to my first comment. It would be useful to know whether a shift to more identity oriented headlines is coincident with or foreshadowing increases in more identity oriented news content.
Okay, so you'd have to look at the newspaper articles themselves for that. And two things come to mind. One, and here I have in mind that at least at the national level, you have a bunch of media outlets that are operating in the same news environment. So you wanna look at the differences across media outlets over time in the extent to which they are highlighting identity-oriented content in their Twitter feeds.
How is that feeding into the content of their news articles, either in terms of the overall share of articles that have identity oriented content, or the extent to which the articles that the newspaper articles are featuring identity oriented themes. So that obviously moves you beyond tweets and Facebook posts and so on to the newspaper articles themselves.
But I think that's a feasible undertaking, and it seems like an important one because we'd like to know whether this is just about how news content gets marketed. Or is the news content itself actually being altered, you might say distorted, in response to these new technologies and new abilities to identify what consumers want?
The second comment is, it's great, I like doing the text analysis and all, but you leaned heavily on the remarks of Ezra Klein. And I think some of your hypotheses just kinda beg for, why don't you go out and interview 25 people who were in a position like Ezra Klien's, and nobody's in exactly his position.
But who were intimately involved in making these editorial judgments during this time period, and conduct a bunch of structured interviews with them and come back. And what did we learn? Well, instead of trying to infer what they were doing and why they were doing it, let's just ask them directly, basically the point.
>> Daniel Hopkins: Cool, those are both super helpful. On the first point, we also looked more generally so trying to avoid being overly dependent. Dependent on what's on social media. We also downloaded for Fox News and the New York Times just what's on their webpages. And we looked for uses and this gets to Justin's point about a dictionary based approach.
Just say usage of terms like white and black. And to a first approximation, you can see some of the same trends, right? Now, how much is this reflecting the differential selection of stories? Now, I'd be very curious to hear other people's thoughts. We've thought about more event based studies where you could imagine if certain events that news media has to cover crowd out other events.
You can look over time at what are the stories that get crowded out potentially? But here, one of the core things you can see is that, to a first approximation, you do see some of the trends that I've been talking about evident over longer periods of time. Not just on social media, but more generally on the web pages of these two exemplary news sources.
Here's also religious identities. Let me then find, I think I have it somewhere, there's a quote from a researcher, a sociologist had done basically, almost exactly what you were, and I apologize. There it is, exactly what you're suggesting, which is the researcher had embedded with both French and American news media outlets, and yeah, partly we're drawing on that research.
So this quotation we like and we use to lead into the manuscript. For example, during a cigarette break in front of the office with two journalists, I asked what kind of articles were usually chosen as headlines for the website. They replied, a headline piece, well, when we have some sex, our Sarkozy, or a touch of racism, it works, right.
And that as encapsulations of what we're arguing, it's pretty good, right? You got gender, you got politics, you got racial and ethnic identity right there. But you are right that the theoretical motivation, in part, yeah, says, hey, let's collect. Let's observe the internal processes and I do think one direction that could be fruitful would be to observe the more of the sociology of how are these newsrooms changing?
We know that they're often cutting a lot of people, but who are the people they're hiring? How are they building up these kind of digital analytics offices? I think that's a fair point in one potentially productive direction. Steve, did I respond to the different pieces of your question?
Okay, great.
>> Justin Grimmer.: Okay, can we give Erin Carter the chance to speak? She's got a question in the, in the Q and A.
>> Erin Baggott Carter: Great, thank you for this interesting presentation. You have a lot of really fantastic and fascinating stuff going on here. So my first point is very similar to Steve's.
I really am curious, as an extension on this project to see to what extent this is driven by editors choosing headlines, which they have just huge latitude to do, right, I mean especially for Op-Ed writers. And if you don't like the title your editor is chosen, too bad that's what they're going with.
But if there aren't really similar changes in the content of news, that makes me wonder about the real world impact of this. And that actually, that would be my suggestion for an extension, which is to go sort of into the lab context or try to do some observation on Twitter.
If people are just sharing headlines, is there any real change either in their own beliefs or their own feelings of polarization or in group out group affect, or those are the people who are seeing the shared content? Do you actually see meaningful political change in the real world, or is this just something that decays really quickly and doesn't have any substantive meaningful impact?
Similarly, is there any differential impact for people who actually go and read the article versus people who just see the title? So I think with, with these sorts of studies, like, the question is often, like, how this impacts actual political polarization of real world context. And I think that'd be really interesting avenue to go.
Thank you.
>> Daniel Hopkins: Awesome, thank you so much. So I will try, always dangerous, but I will try, as I answer, to bring up figure that may speak to that. Yeah, I think those are great and really helpful points. So on the second one, one of the things that we've been doing is we've been relatedly analyzing panel data that we happen to have collected from a population based panel that we've been running here at the University of Pennsylvania from 2008 up through 2020.
We're probably done. But so one of the things that we can do is we can look at, say, a wide range of attitudes and how they have shifted over time, and we can differentiate them by media consumption patterns. And so I'm not trying to shoehorn a whole separate paper presentation into the answer to one question, but I think one thing that we were struck by is that, so here we have white respondents, and this is a population based panel.
It was done by the knowledge panel, which I think is most recently owned by Ipsos. But what you can see is that there aren't really differential trends in terms of whites prejudice depending on whether they watch the same amount of Fox News in 2012 and 2020. Whether they watched never watched it, whether they increased their Fox News diet whether they decreased their Fox News diet.
And this is just one Fox News, one outlet among the 19 we've looked at. But that is just to say that I agree that the question of the impact of the long term impact is a really important one to analyze and one that we're at least to some degree, trying to undertake separately.
And I think that the headline selection point is also a really good one. And, yeah, that we could then look in more depth. We partly were interested just in defense of our research design. We were interested in the tweets, the blurbs, the headlines, because on social media, the fraction of people who actually engage with the content beyond that is pretty low.
And so we thought that that was a reasonable first cut, but I think that another thing that one could look at is the, there are lists of journalists and their beats. And so you could also look at the shifting composition of beats over time and partly try to tell a story.
You could relate the changing composition of beats over time to news media outlets dependence on, say, web based advertising as a way to get at this in a deeper way. But I think, yeah, the comments are well appreciated and that is one of the ways in which we've tried to get at the question of what are the impacts of this?
And this research would suggest that it's more about the selection into the media venue than it is anything about the causal effect of exposure.
>> Justin Grimmer.: Okay, perhaps we can give John Taylor a chance to ask a question.
>> John Taylor: So, first question, is there a tendency for people to try to hide their reactions more.
It seems to me that's there's a temptation to do that, and that interferes with. The time series, quite a bit, if that is happening. And the second question is whether there's a tendency to search for less opinionated things. I'm just wondering, for example, public TV, public radio, KQED, I'm noticing a little bit in our own house that that's happening.
>> John Taylor: Is that a general phenomenon that you have to watch out for? Those are two questions. Thank you. Thank you very much. Great information, by the way.
>> Daniel Hopkins: Yeah, no, thank you. So on the question of are people hiding reactions more? That's really interesting, and I can speak to it to some degree, so I'll just talk about the results.
So, with Facebook, the Facebook data are really useful because we can see a range of activities, including whether people clicked on it. And what's interesting about clicking is, clicking, if Justin clicks on a link, no one knows except the social media firm and wherever that link goes, right?
But his friends don't know. That's a less public behavior than if Justin retweets or likes something, right? And one of the things that we were finding is that identity content on Facebook, some kinds of identity contents for some subgroups, generate more shares. So people saying, hey, you gotta read this article, but fewer clicks.
So you gotta read this article, but I don't need to. Which is an interesting, and I think it's instructive. It's to your point, John, in that I think what it suggests is that there is a difference between how people are reacting publicly, that publicly, sometimes people wanna associate themselves with this content, tell everybody else in their group, like, hey, I'm a good group member, or, I saw this, I want to play my role and perform that identity, pass this along versus I don't need to click on it.
And as I was scrolling through some of the results you might have seen, we had also done a number of experiments where we ourselves put news stories on Facebook and varied how we build them. And one of the things that we were finding is that when we used very heavily identity laden kinds of treatments, so, like, we'd have a Black Lives Matter image associated with a news story, that often generated fewer clicks.
And I think the thing that explains it, that we were not holding constant, is the novelty question. People, when they saw that Black Lives Matter sign in the fall of 2021, they knew exactly what that news story, or at least they had a strong suspicion about exactly what that news story was going to be.
And so they can share it as a way of affirming a certain kind of identity or advancing a certain kind of identity without actually needing to read the story. Then your second point about searching for shifts in media consumption, maybe people being less interested in opinionated content. It is undeniably the case that partisan groups that are threatened by the direction of current politics seem to engage in more media consumption.
So daily costs, which I think is based out in the Bay Area, recently announced significant layoffs to their media wing. And they detailed just how far paid views have fallen since the inauguration of Joe Biden. And on the flip side, Fox News viewership is up. We saw some of the reverse patterns during the Trump era.
And so I certainly know that there is extensive research on which groups of partisans are more engaged. I'm lucky to be tagging along on an NSF-funded project here at Penn, where we're looking at the parameters of search engines that produce more sustained engagement from Democrats, Republicans, and people who don't identify with either side.
It's unfortunately very, very early days in that research project, but I'd be excited to report back in some number of months or years as to what we're learning about the kinds of search engines that are more or less satisfying for people across the political spectrum.
>> Justin Grimmer.: Okay, so we're running up on time here, so we're gonna go ahead and stop the recording.
Thank Dan for an incredible presentation and Q and A, but please stick around because we can engage in this more informal discussion and get into a little bit more details of the analysis. So thanks again, Dan, and see you all soon. Next time.
Daniel J. Hopkins is a professor of political science at the University of Pennsylvania and has also taught at Georgetown and Yale Universities. Professor Hopkins’s research focuses on American political behavior, with a special emphasis on elections, ethnic and racial politics, state and local politics, and quantitative research methods. He is the author of two books, The Increasingly United States: How and Why American Political Behavior Nationalized (University of Chicago Press, 2018) and Stable Condition: Elites' Limited Influence on Health Care Attitudes (Russell Sage Foundation, 2023), as well as more than 50 scholarly articles. His work has appeared in FiveThirtyEight.com, the New Republic, and the Washington Post. He has also served in the US federal government and the City of New York. He received a PhD in government from Harvard University.
Steven J. Davis is senior fellow at the Hoover Institution and professor of economics at the University of Chicago Booth School of Business. He studies business dynamics, labor markets, and public policy. He advises the U.S. Congressional Budget Office and the Federal Reserve Bank of Atlanta, co-organizes the Asian Monetary Policy Forum and is co-creator of the Economic Policy Uncertainty Indices, the Survey of Business Uncertainty, and the Survey of Working Arrangements and Attitudes.
Justin Grimmer is a senior fellow at the Hoover Institution and a professor in the Department of Political Science at Stanford University. His current research focuses on American political institutions, elections, and developing new machine-learning methods for the study of politics.