Networks of Computational Social ScienceIan Dennis Miller
University of Toronto
My name is Ian Dennis Miller. In this presentation, I discuss the computational social science literature, which I have analyzed as a network of collaborations.
I start with my objectives and then describe the structure of this talk.
In the current work, I examine the scholarly literature in order to determine the context of my own research, with respect to the research that others are doing. As a result of this literature review, I have identified a scientific network that spans many different literatures but nevertheless connects to form a coherent whole.
In this talk, I present the methods I used to build a library of scholarly citations. Once I collected the relevant literature, I performed a network analyses of my citation library. I analyzed the network formed by co-authorship, in which collaboration forms the links of the network. The result of that work identified communities of collaborators. When these communities were visualized, I was able to locate the relevant academic conversation.
This talk is structured like an academic article. We begin with the introduction, motivations for doing this work, literature review, and background. Over the remainder of the talk, I’ll describe the methods, results, discussion, and finish with conclusions.
My primary motivation for the current work was to locate myself within the academic literature. The general direction of my graduate research involves the computational modeling of humans sharing memes. During the years since I started grad school in 2011, new conferences emerged and several communities have attempted to form disciplines that are adjacent to my research interests.
As I conducted my research, I kept finding relevant methods in distant literatures. In the absence of clearly-delineated disciplines, I wanted to organize my reference library to find the relationship between my own work and these other literatures.
In the current work, I apply many of the methods I’ve learned from my studies of network sciences in order to synthesize the literature. At a strategic level, I was motivated to do this work in order to build bridges that connect these disparate literatures. In order to build the bridge, I would need a map - but there was no map. Thus, I started by making the map.
I became aware of the need for this synthesizing work as I was studying for my PhD qualifying exams. Due to my interest in memes and viral phenomena, I chose three topics to summarize for my oral examination: contagion, social network analysis, and memes. For each topic, I selected 30 articles to read, for a total of 90 articles. Even though these topics are related, both theoretically and thematically, there was less overlap between these literatures than I would have expected.
I was compelled to find a way to directly tie these topics together. Over the course of learning about social network analysis, I was introduced to new methods for exploring network data - and it struck me that these methods could help. Thus, I started this work by applying network methods to the general challenge of finding relationships among the 90 articles I read for my qualifying exams.
Let’s begin with some background.
Small World Problem
In 1967, Travers and Milgram published a fascinating study that sought to characterize our social connectedness. This article pitted two theories of human connection against one another. One theory held that people existed in separate social networks that never overlapped; in this model, people exist in entirely separate orbits. The other theory held that all people are connected to one another - even if some connections are more distant than others.
Travers and Milgram used a methodology in which they mailed letters to Kansas asking the recipients to return those letters to the researchers in Boston. However, there was one special instruction: during the return process, the letter should be physically handed to somebody well-known - on a “first-name basis.”
Travers and Milgram received many of the letters back, each of which contained a record of the chain of contacts who acted to forward the letter. Travers and Milgram then counted the number of hops required for the letter to be passed back to their laboratory and plotted that distribution, which is presented in the image on this slide. You see the peak in the distribution occurs at 6 links, meaning the most common social chain length connecting the researchers to the letter recipients was 6 links long. This is where the famous term, “6 degrees of separation,” originated from. From this result, Travers and Milgram ultimately concluded that everybody is connected; the social circles are overlapped.
Strength of Weak Ties
The connection between two individuals can be characterized in several ways, apart from path length. Mark Granovetter looked at the strength of the connections between individuals, which he categorized as being either strong or weak. When two individuals have many shared connections - that is, when their social networks have a higher degree of overlap - then those two individuals have a strong tie between them. A weak tie occurs when two individuals are connected but few people from their respective social networks knows anybody in the other network. Granovetter wondered whether stronger ties lead to greater influence: a higher percentage of social network overlap might imply greater social influence between those individuals.
Granovetter observed that separate clusters of strong ties can be connected over long distances with a small number of weak ties. These conditions result in a network topology with unexpectedly short path lengths connecting any two people in the network, just like Travers and Milgram observed.
While Granovetter’s work was originally sociological in nature, it applies to scholarship as well. Library search tools make it easy to go one hop in the network, so Granovetter’s strong ties are fairly easy to identify and navigate. Once you have located an article, online scholarship portals streamline the process of linking to cited articles and other articles by the authors.
Typical search ranking algorithms favor strong ties, placing well-cited articles higher and suggesting co-authors who frequently collaborate. For example, PageRank, a famous article ranking algorithm, is heavily influenced by the number of citations an article has. Although this kind of ranking usually produces good results, it also down-ranks on the basis of weak ties - and in some cases, those might be the links you are looking for.
Small World Networks
The “small world” discovered by Travers and Milgram was abstracted and formalized as a short algorithm published in 1998 by Duncan Watts and Steven Strogatz. The algorithm can be stated in a single sentence: “starting from a ring lattice with n vertices and k edges per vertex, rewire edges at random with probability p.”
I have broken this sentence into three fragments; let’s talk about them one at a time. It begins: “starting from a ring lattice.” The image in this slide depicts 12 vertices connected into a ring, which forms a basic lattice.
The ring lattice has “n vertices and k edges per vertex.” In the image on the slide, we have n=12 vertices and 2 edges per vertex. Each vertex has one link going to the left and one to the right. There can be more than 2 edges per vertex but the lattice depicted on this slide is about as simple as it gets.
Finally: “rewire the edges at random with probability p.” To “rewire” means to remove a connection from one vertex and reconnect it to another vertex chosen at random. The probability of choosing any given vertex can be varied from 0 to 1. When the probability is 0, you end up with an unmodified lattice because nothing ever changes. When the probability is 1, then you create a completely random network in which edges are connected at random. What’s nice about the Watts and Strogatz algorithm is that you can vary the probability to obtain small world networks that exist somewhere between randomness and order. For network scientists, this algorithm provides a reliable method to produce small world networks for research.
Small world topologies facilitate epidemic spread. For contact-based contagious processes, strong ties provide multiple opportunities for exposure. Among a set of strong ties, if one contagious connection doesn’t infect you, another one might. Weak ties are the key to transporting infections across long distances to new clusters of strongly-connected neighborhoods. Weak ties can function like shortcuts in the network.
Reconsidering 6 degrees of separation in the context of weak ties caused me to wonder about the implications for scholarship, collaboration, and co-authorship. Co-authorship is an example of a strong tie that usually occurs between people who know each other very well. What then for the relatively rare connections that permit information and influence to travel long distances? What about the weak ties?
Although small world networks describe many social phenomena, including co-authorship, these networks do not describe all social phenomena. Scale-free networks, first described by Barabasi and Albert in 1999, are a different network topology that models the characteristics of other social phenomena, including scholarly citation. Scale-free networks also describe the dynamic linking structure of the world-wide-web and celebrity status in online social networks.
Scale-free networks can be modeled with an algorithm called “preferential attachment” that produces networks with a “rich get richer” dynamic. Let’s consider celebrity Twitter users for a moment. The most popular celebrities are most likely to receive new connections - that is, they will become even more popular as the network grows. The preferential attachment algorithm simply states: when building the network, connect new nodes to the network with a probability proportional to the number of edges a given node has. Those nodes with the most edges are the most likely to receive new connections.
Barabasi and Albert discovered there were fractal aspects to scale-free networks. Zooming in or zooming out results in a similar-looking network, irrespective of scale. In addition to describing online social networks like twitter, these networks also generalized to our human genetic network and the distribution of mass across the universe.
Academic citations follow a scale-free topology because the most-cited articles are more likely to receive new citations. As well, citation data structures tend to be really big: when there are thousands of articles and each one cites dozens of other articles, the number of citations rapidly increases.
Scientific Collaboration Networks
In 2001, Mark Newman looked at patterns of co-authorship in several disciplines. Disciplines that were earlier to adopt bibliometric practices, including biology, physics, and computer science, were among the easiest to study. These communities provided open access to articles and facilitated article pre-print practices. For example, Cornell University’s arXiv provides an article repository as well as indexing and search. The bibliometric work conducted by these disciplines created a data structure that is suitable for network analysis. Newman mined those bibliographic data to confirm that academic co-authorship yields small world networks. The different disciplines yielded different networks - but they were all small worlds.
This leads us to the following paradox: when all authors are somehow connected in a small world, why do we observe disciplines or “silos” in academia? Of course, Newman studied these disciplines in isolation from one another, so perhaps each one is a separate small world - but anecdotally, I think that is unlikely. I suspect silos result from different factors that have to do with network distance. To the extent that two disciplines are distinct, they may use different words to describe the same thing. Without shared vocabulary, it will be difficult to harmonize keywords or even perform scholarly searches.
My suspicion is that silos are the effect of network clustering, which causes longer average path length. A byproduct of longer paths is that, as ideas require more hops, more adaptation is required to pass ideas along the network. Increased adaption causes stronger vocabulary effects - jargon - leading to the same concept having multiple names in different disciplines. I will revisit this idea several times throughout this talk.
Scholarly Communication and Bibliometrics
By the turn of the millennium, bibliometrics was a fairly well-established discipline. At that point, Borgman and Furner reviewed the bibliometric literature to date from an information sciences perspective. Their review provides a taxonomy for characterizing bibliometrics in a several ways.
They identified numerous scholarly activities that produce bibliometric records, including writing, citing, submitting articles, and collaboration. Bibliometrics can be aggregated at different resolutions: person-level, group-level, discipline, institutions, or even nations. Bibliometrics can also be analyzed in terms of the kind of publication: whether it’s a research article, a review, or reference work. Each of these different kinds of publications has different bibliometric properties, both in terms of how the publication is constructed and the way it is used by the rest of the literature.
Co-authorship: Structural/Socio-academic Groups
Publication and citation do not occur in a vacuum; there is a socio-academic structure supporting the work. Rodriguez and Pepe asked whether institutional communities drive collaboration or if, instead, collaboration is driven by research interests. To answer this question, Rodriguez and Pepe compiled a database of co-authorship data that they augmented with author information, including department and institutional affiliation.
Based on their community detection analysis, Rodriguez and Pepe concluded it wasn’t so much the nature of the research that determined who collaborates with whom. Instead, co-authorship was driven by the department the people were located in and the institution they were affiliated with. This is consistent with the intuition underlying Granovetter’s strong ties: the higher degree of social overlap, the greater the influence.
In the current work, I focus on coauthorship in order to accomplish several goals. First, I wanted to find out who was talking about the topics I was interested in. Additionally, I wanted to uncover the network structure of the beliefs underlying my research interests. I make an assumption about shared beliefs: whenever coauthorship occurs, it is because the authors believe in what they are publishing. Thus, to identify the network of beliefs, start by identifying the people who have collaborated.
For the reasons I’ve discussed, a co-authorship analysis will be influenced by institutional affiliation in addition to research interests. I think of these as two possible kinds of collaboration: convenient collaborations and interest-based collaborations - and between these two, there are more convenient collaborations. I didn’t want to miss out on those rarer, weak-tie, interest-based collaborations that might indicate research overlap.
Nevertheless, the institutional network can be beneficial to this co-authorship analysis. Due to the overlapped social networks of strong academic ties, I can use biographical information to augment my search. I don’t necessarily need to be restricted to a single database of scholarly publication: I can potentially locate co-authorships through several avenues, including institutional affiliation and mentorship.
The biggest reason to focus on co-authorship is that it is the “right-sized” problem. I wanted to undertake a project I could finish in a reasonable amount of time that would provide some actionable insights. This study is based on a curated citation library consisting of around 2500 articles. Although that might sound like a relatively small bibliometric dataset, this is a hand-curated collection of 2500 articles. Perhaps most important of all, this quantity of articles could be collected over the course of several months and was sufficient to produce useful results.
Alternatives to coauthorship
There are many bibliometric approaches to the challenge of mapping a discipline, many of which I have already alluded to, but I will briefly discuss why I did not use one of the bibliometric alternatives to co-authorship.
Citation data are one of the more common objects of bibliometric analysis - largely because the data are fairly easy to acquire using information technology. Off-the-shelf software and existing databases are well-suited to citation analysis. The best way to leverage citation data is to collect as many citations as possible - but doing so would violate my objective of keeping this project small.
Co-citation is an interesting corollary to citation. By analogy, it is like a strong tie of citedness. When two articles - A and B - cite the same third article C, we say that A and B have a co-citation relationship. Co-citing articles have a higher probability of being related than articles that are not co-cited. Co-citation requires citations to be linked to the article that originates the citation, which is a informational complication. Thus, co-citation suffers from the same data scale issues as citations: a technical solution would be required.
Articles include “acknowledgements,” which appear as loosely-structured paragraphs of natural language either at the beginning or the end of an article. As such, it is difficult to acquire acknowledgements data in the first place and, once obtained, the paragraphs are difficult to parse. However, acknowledgements imply a stronger tie than citations because acknowledgements are typically restricted to people that the authors knew. Therefore, acknowledgement data are rich and could be theoretically expected to indicate shared beliefs - but raw acknowledgements data are difficult to work with.
It’s also possible to analyze the institutions and mentorship lineage of academics. In most cases, these data are unstructured and therefore very time consuming to work with. However, this type of work is not without precedent: there are several projects that trace the lineages of mathematicians, for example. Mentorship ties tend to be very strong - with lots of network overlap. Certainly, as it relates to my original claim regarding the identification of beliefs, mentorship networks strongly transmit beliefs. However, as with acknowledgements, the work is complicated because the data are unstructured.
In summary, some bibliometric methods require big data and other methods require significant data cleaning. None of these alternatives to co-authorship provide the right ratio of data size and belief indication - so, by process of elimination, co-authorship is the best for my purposes.
I claim that coauthorship indicates a shared belief and these beliefs accumulate into a discipline consisting primarily of strong ties; ergo, clusters of co-authors are disciplines. Such disciplines are called by many names, including: colleges, schools, arms, branches, and lines of reasoning. These words are synonyms, from a network perspective, with the same connotations about the underlying network structure of scientific knowledge.
I suspect that weak ties connect these various clusters - but these weak ties can be tricky to identify using current literature search methods. However, with enough effort, perhaps these clusters could be interconnected once a sufficient quantity of collaborations are analyzed.
Since it can be tricky to find weak ties that would otherwise be bridges between disciplines, then, as Travers and Milgram demonstrated, longer chains of strong ties might also exist. If I were to follow these strong ties far enough, I hoped I would eventually discover an author who collaborated with somebody already in my network. This entire work is predicated upon the idea that I could follow enough strong ties to eventually identify those weak ties that bridge the disciplines.
Now I describe the way I conducted this research.
First, I’ll describe the way that I acquired the data and stored it. Then, I’m going to talk about the scholarship dimension here: how did I perform the search? Then I’ll discuss the way that I analyzed the data once I had acquired it. The analysis consisted of two major steps: building a network of citations and performing the actual network analysis. Finally, I include reporting methods because there’s technique involved in creating network visualizations.
Data Methods: BibTeX
BibTeX is a file format that is widely used for representing and sharing citations. It’s not in any way perfect; there are numerous mis-features that create ambiguity in BibTeX. I’m not aware of a canonical syntax for how to produce BibTeX and I’ve found that every BibTeX system has quirks.
However, in defense of BibTeX, this format is ubiquitous. Almost every serious citation management tool provides some level of functionality for interfacing with BibTeX. In my estimation, the BibTeX format has the highest rate of adoption online.
There are lots of interfaces for translating BibTeX data into different environments, although most of them are somehow incomplete. BibTeX is supported by the scholarship tools I use, including Zotero, a citation manager, LaTeX, a document authoring language, R, a statistical computing environment, and Python, my preferred computation framework. BibTeX is a good compromise for bibliographic work because it is so widely supported.
Data Methods: Zotero
Zotero is the platform I use for citation management. Zotero can run locally, on the desktop, and also online. I store my citations both online and offline thanks to Zotero’s design.
There are alternatives, like Endnote, Mendeley, and Papers - but Zotero is open source software, which I think is fundamental to the conduct of reproducible science. In addition, Zotero is extensible with plug-ins and, as I already mentioned, it natively imports and exports BibTeX. In particular, there is a plug-in called BetterBibTeX that streamlines the citation export process. Rapidly exporting citations turned out to be a critical part of my synthesis pipeline so it was important to automate as much as possible.
One final point I’d like to make is that Zotero integrates with web browsers to save citations directly from online catalogs and databases. Since most of my scholarship work occurs through a web browser anyway, Zotero provides great support for my workflow.
I relied on numerous scholarly publications databases, many of which are specific to a discipline or require some domain familiarity. I’m going to provide a quick overview of these resources.
I probably used Google Scholar more than any other search tool but, as a result of this experience, I now have reservations about it. The closed, proprietary nature of the Google Scholar database produced circumstances with conflicting incentives. On the one hand, Google provides free access to their database so it is reasonable to limit access within certain bounds. On the other hand, academic scholarship is accelerated when access and availability are increased. Therefore, Google provides access to Scholar as long as it is not heavily used.
Apparently, my bibliometric research constitutes heavy use. I suspect my profile became associated with a high rate of usage and, as a result, I was lumped into some abuser category. Google Scholar began limiting the number of searches I could perform using Captcha, which is their anti-bot technology. The way Captcha worked is that I would be presented with an image recognition tasks involving objects within a picture of a roadway or storefront. Eventually, I became a de facto worker for some unknown Google machine learning project, which is like a scene from a nightmare dystopia.
I used many other search tools, including Worldcat, citeseer, and domain-specific databases like DBLP, APA, and arXiv. One of the benefits to domain tools is the ability to navigate to related resources based on criteria other than keywords. I never encountered rate limits with these tools but I also didn’t use them very heavily. The main drawback to domain-specific tools is that the interface is different on each one and each one must be learned. On the other hand, domain-specific search tasks can be benefited by customizations that adapt the search tool to the domain.
The University of Toronto library portal was essential for this work. Among other things, the library aggregates licenses to provide access to a vast array of publishers. Due to licensing restrictions, this process is not open in the sense of open source software. However, by academic standards of openness, the library is an open institution that provides access to the community it serves without discrimination. Without getting into a discussion about the academic publishing economy, suffice to say that the library was an invaluable scholarly resource for this research.
Observations of academic publishing over time
As a brief aside, I can qualitatively characterize the literature by decade, according to my search experience.
There is an explosion of publication that occurs in the decade following 2010. In 2018, the current year, virtually everything is computer-readable, everything is well-indexed, and almost all publications are available as digital objects. Some journals and conferences have already moved to digital-only distribution, ceasing hardcopy publication altogether. A consequence of academic literature digitization is that online scholarship methods can be applied to almost everything published this decade.
In the decade following the year 2000, not all articles are computer-readable but almost all are available online. Optical character recognition (OCR) has been used to enable full-text indexing for many articles that were not published in digital format. Articles that have not been OCR-ed may suffer from reduced keyword search and other awkwardness.
The 1990s are when things start getting a little bit more irregular. During this decade, the academic publishing industry was transitioning to digital and most articles were initially distributed as hardcopy. Not as much of the published record from the 1990s is computer readable - and, in fact, not all artifacts are even available online. Starting with the 1990s, it becomes increasingly necessary to physically visit the library in order to track some artifacts down. Nevertheless, almost everything published in a major journal is going to be online in one form or another.
In the pre-digital days, publication is idiosyncratic and irregular. The time period between about 1950 and 1990 is fairly easy to search due to excellent indexing by libraries - a process which was labor-intensive and took decades to perform. However, there’s less standardization across publishers. Articles become hard to track down for different reasons including availability, citation practices, and the likelihood that works are published as books instead of articles. Any time research from this period is published as a book or chapter, this reliably prompts a visit to the library stacks.
Prior to 1950, all bets are off. Once again, libraries have maintained an excellent index of publication for this period. Although some artifacts published between 1920 and 1950 are available online, this availability is not reliable. Relatively fewer journals and conferences even existed prior to 1920 - so many familiar scholarship methods are irrelevant for these decades. When my search brought me to this time period, I frequently had to use non-bibliographic methods to continue locating coauthors.