Two and a half years ago I subscribed to Bill Bishop’s Sinocism Newsletter to aggregate and streamline my news consumption on China and Sino-US relations. I don’t regret it. Through Bill’s newsletter, I came upon another newsletter, ChinaAI, which is the work of Jeffrey Ding from the Future of Humanity Institute at Oxford. Both are great supplements, but have come at odds with each other recently.
In response to the increasingly conflictual trend in US-China relations, Bill’s editorializing has taken on a harder, more cynical (the newsletter is not called “Sinocism” for nothing) edge while Jeff has been a voice for moderation. In particular, he has called on the China watching community to do a better job lifting up diverse voices into gatekeeping positions. In that issue, Jeff also made a general call asking someone perform an text analysis of Sinocism to explore trends in general sentiment over time, different memes—New Cold War, AI, tech war, etc.—content, and so on.
This post is a response to that call. As a longtime reader of Sinocism, I felt I was in a good position to combine the surface-level insights from a text analysis with my personal “domain” knowledge. Also, it was a good opportunity to get my hands dirty with some different NLP libraries. Below I provide a quick overview of my methods and discuss the results. In future posts, I will dig into the
Python side of things.
To get the emails, I requested them from Google’s Takeout service. For the uninitiated, it’s a great way to archive data from your Google account(s). Google bundles the requested data and sends it to you in an email. Upon receipt, I converted them from .mbox format to HTML and began parsing away. I broke the text up into three categories:
- Introductory Remarks
- In-text Commentary
- Entire Newsletter
Introductory remarks refers to Bill’s headline commentary. In-text commentary refers to his commentary on specific articles. Entire Newsletter is all of the paragraphs including summaries of the news articles and his commentary.
I chose this breakdown because Bill has altered the newsletter’s formatting multiple times since its inception. Certain aspects are fixtures like the ‘Essential Eight’, a set of 8 articles he’s rank as most important for the day, and the content headers. Others have changed a fair amount.
Of those features that have changed, the most relevant for an NLP analysis is his technique for commenting on the news articles. At first, he explicitly set-off his commentary with // Comment: or Comment: and most of this was in the body of the newsletter. Over time he has slowly shifted his commentary to the head of the newsletter and away from the body.
Now when he makes in-text comments, they are rarely explicitly set off apart from the use of bold typeface. The result? It is less easily traceable with regex or HTML tags. While this may be more aesthetically streamlined, it makes my job harder. I did my best to isolate the in-text commentary, but there is indeed a fair amount of spillover from the article summaries, especially in more recent issues.
In my unscientific estimate, in-text commentary picks up about 70% Bill’s thoughts and 30% news content. The introductory remarks are exclusively Bill’s opinions. The entire newsletter content is a decent proxy for trends in China as well as Sino-US relations, a thermometer so to speak, but should not be taken as unbiased. Much of Bill’s editorializing is via selection. What articles and topics he chooses to include, how much attention he gives them, and where they are placed all are subtle, yet important forms of editorializing.
A quick and dirty bag-of-words text analysis as done here won’t pick up on these subtleties, but it does provide food for thought.
The analysis proceeds from general to specific. I start by exploring the commentary length over time. Then I dig into the most common unigram and bigrams. Afterwards, I briefly look at two memes that are increasingly prevalent in news about China: artificial intelligence and New Cold War. I conclude with a temporal sentiment analysis using Lexicoder Sentiment Dictionary (Young and Soroka 2012), a canned dictionary that comes with
Top Words and Phrases
Next, I wanted to get a sense of whether and how closely the content of each type of commentary aligns with the entire text. I did this in two ways. The first way is by comparing a simple count of the top 50 terms by frequency faceted by the three categories. The results are below.
Unsurprisingly, terms like China, US, Trump, Xi, Beijing, PRC dominate the top 5 slots across all three categories. A quick scan of the in-text commentary suggests that it has mostly focused on insider knowledge/rumours on the trade talks and the Party Congress/other party meetings. The introductory commentary is almost exclusively dominated by the trade war, but only because this has so heavily colored bilateral relations.
Only one chinese character, 的 (de) a possesive particle, makes it on the list. If I had utilized a Mandarin dictionary, it probably would have been removed as a stopword. On a side note, an interesting future analysis would be to focus explicitly which Chinese articles Bill includes, their content, and the top news outlets he taps apart from the People’s Daily.
While the single tokens might make for a nice plot, they are not that informative. So the second way I explore the congruency between commentary and content is to use
spaCy’s pretrained part-of-speech (POS) tagger and extract the top 50 noun phrases, dropping collateral damage caused by the Mandarin. The figure below presents the findings.
The top 5-10 phrases closely mirror the results from the unigram approach, but slightly more proper names and places making the cut, e.g. Wang Qishan, Guo Wengui, Liu He, etc. Moving down the list, things become more interesting. Even though overall counts tend to drop off, we can start to see a divergence in what issues Bill pays attention to vis-a-vis the overall content.
Looking at the entire newsletter, we see the phrases are well-distributed across multiple topics. For instance, local_government is probably a result of the PRC’s focus on addressing local government debt issues. This topic was especially pronounced throughout 2017, but has since taken a backseat to the trade war. Lower on the list are phrases like national_security, a political staple; artificial_intelligence which is increasingly in the spotlight; foreign_investment, a sticking point in the trade talks; peking_university which has undergone a significant change in leadership and been engrossd in a few scandals; new_zealand, an epicenter for news on China’s United Front influence operations; discipline_inspection in relation to Xi’s corruption campaign; and environmental_protection, a nagging problem in China.
Moving to the in-text comentary, we see Bill’s focus centers more on high-level Chinese party officials and other notable figures, the Trump administration and its members, and “insider knowledge,” so to speak. These comments aim to provide deeper context and Bill’s personal read on the private dealings of both governments. This is not evident from the phrase counts alone, but my personal experience tells me so. Phrases such as ‘another_reminder’ and ‘another_sign’ usually precede clauses like ‘of Xi’s increasing consolidation of power’ and ‘that neither side is likely to make any concessions soon.’ The tailored nature of this commentary is intuitive given that it is often in response to one or more news specific news articles.
Finishing with the introductory remarks, three topics dominate apart from administrativia: the trade war, US-DPRK nuclear negotiations, and Sino-US technology competition. The first and third topics could be nested under a broader topic of Sino-US competition. Bill and many in the Beltway have framed these two issues as part and parcel of a ‘New Cold War’ (Elsewhere, I dubbed this meme ‘Cold War 2’ or CW2 for short). However, I think they are worthy of a distinction. Bill usually conveys his personal opinion on trends in each of these areas and occasionally offers predictions or things to look for on the horizon.
News articles often bundle coverage of tech advances—especially in areas like machine and deep learning, cloud computing, UAVs, and the hardware necessary to power these algorithm—under the umbrella of artificial intelligence. Typical coverage of the tech industry in China is no exception. While I believe this oversimplication gives the misleading impression of AI as an monolith and further obfuscates important distinctions, it is the norm among non-specialist journalists. Thus, despite my personal convictions, I employ just the term “artificial intelligence” to search for instances of the AI meme in the newsletter. CW2 allusions tend to be slightly more nuanced, sometimes not referencing the term “Cold War”" at all. I therefore represent this meme with the phrases “cold war,” “containment,” “tech war,” and “arms race.” I then used glob wildcard matching to trace the presence of these memes in each issue of the newsletter.
Juxtaposing the two types of commentary with the entire newsletter, one immediate result stands out. Although I have felt Bill increasingly employs Cold War-esque frames in his commentary, the data does not support this sentiment. The last time the topic appears is in September 2018. One conjecture is that the CW2 meme has become more prevalent in the news articles Bill chooses to feature but not his commentary. Another, more plausible, conjecture is that my hasty parsing of the text coupled with an under-defined set of terms caused me to miss more recent appearances of the meme in his liner notes.
The second result is the rapid growth in the CW2 meme within the general media since the end of 2017. While AI has been a fixture in coverage of China, CW2 is a relatively newer beast. I consider Campbell and Ratner’s “The China Reckoning” piece in Foreign Affairs to be one of the defining events that touched off the so-called “China Consensus” in DC. The plot confers some descriptive legitimacy to my hypothesis. In the two months prior to this piece CW2 appears 10 times in the newsletter, and 17 times in the two months after.
I am not implying a causal relationship here. Their piece could easily be considered a symptom of the shift rather than a harbinger. I merely believe this publication is a useful heuristic to mark the structural shift in Sino-Us relations.
Ideally, one should tailor a sentiment dictionary to corpus under study, but I haven’t the time for that. As an alternative, I use
quanteda’s canned sentiment dictionary. While I am generally skeptical of the validity of a sentiment analysis, especially when using a canned dictionary, it can provide some exploratory utility. Below I plot the Positive-Negative ratio (valence) of the newsletter over time. The difference is normalized by the total number of tokens in each category per issue. I again demarcate the Campbell and Ratner piece’s publication date.
Based on this dictionary, the entire newsletter usually sits in the 2-5 percent negative valence range, with occasional positive spikes. The in-text commenatary is even more negative, with some issues containing over 20% negative tokens. However, the in-text commentary does appear to have more positive content than either the entire newsletter or the introductory remarks. I explore this more below. The intro remarks display a similar negative lean.
Does this tell us anything we don’t already know about Bill’s editorializing? Not really. Bill named his newsletter Sinocism as a play on cynicism for a reason. That alone should have sufficiently conditioned our expectations.
What about in regards to the entire newsletter which I believe is a useful barometer for bilateral relations. Again, the sentiment analysis turns up few surprises. As an interlocutor in the DoD aptly stated the other day, “… we went from cooperating for cooperation’s sake to where we are now, competing for competition’s sake.” I’d be surprised if the relative valences looked any different.
The one finding that stands out is how much the in-text valence vacillates in recent months and, in particular, the sheer number of positive issues. To look into this further, I randomly sample four issues from after December 2018, two with negative valences and two positive. Let’s take a look.
These samples lend further credence to my skepticism of canned dictionaries. The sentences from the relatively positive samples do not evince positivity. If anything, there seems to be a fair amount of cynicism throughout the May 6, 2019 newsletter. What’s more, Chinese propaganda is peppered about the June 19, 2019 newsletter which is probably driving the relative positive valence. A sample size of 2 is not nearly large enough to be representative, but given my hasty data cleaning, I’m confident the sentiment analysis is mainly noise.
The text analysis did not uncover anything counter-intuitive to long time readers of Sinocism. For those who do not watch China closely, the results provide a good window into Sino-Us relations over the past couple years and flesh out Bill’s editorial priorities.
The most intriguing finding, in my view, is that mentions of the CW2 and AI memes are extremely scant in the commentary, despite their consistence prevalence in the news. This does not conform to my intuition gained by reading the newsletter daily. Either my perception of the newsletter is biased or the bulk of Bill’s editorializing is structural, as opposed to verbal, in nature. The reality is likely some combination of both these factors.
It is also interesting to note that the AI meme predates CW2 and only more recently has the latter started to engulf the former. These two memes are already exceptionally reductionist and need not be further merged into a “New AI-fueled, Tech Cold War” meme.
In coming weeks, I will walk through my code for this post step-by-step, breaking it into multiple installments. If anyone has specific requests for some type of NLP or otherwise they would like to see done on this data, let me know. I’ll do my best to work it in.
I normally adhere to an open data policy, but the newsletter is behind a paywall. I don’t think Bill would be happy about me sharing 2.5 years of archived issues. If anyone is interested in doing their own analysis, contact me and we’ll see if we can work out a data format that protects Bill’s intellectual property.
Thanks for reading.