Building a coauthorship network with R

In this post, I will share with you my experience in the creation and visualization of coauthorship networks with R. We are going to focus on a particular type of network centered around one scholar (me in this example). The nodes of the network will be my coauthors (people with whom I published at least one paper) and the link between two coauthors will be proportional to the number of papers they cosigned (if any). We will first scrap data from Google Scholar using the R package scholar to build the network and then rely on the package networkD3 to visualize it.

At the end of the process we will obtain this network.

 

Scraping data from Google Scholar

We take as input my Google Scholar id and the one of my coauthors (who have a Google Scholar account). This information is available in the url of a Google Scholar webpage.

https://scholar.google.com/citations?user=VSyM8fEAAAAJ&hl=en

I did not manage to automatically get my full list of coauthors from Google Scholar with the functions of the scholar package so I did it manually. I gathered all the needed information in a csv file Coauthors that contains five columns:

  • ID: Unique integer node id. The first node is the central node (me)
  • Name: Full name of the author as it will appear on the final network
  • Scholar: Google Scholar id
  • Group: You can define different group of coauthors displayed in different colors
  • W: Weight of the node that will be used to set the size of the circles. We will update the value according to the number of publis in common with me.

We import the file in the dataframe co.

co=read.csv("Coauthors.csv", stringsAsFactors=FALSE) 
nco=dim(co)[1] 

Google Scholar data can be messy, particularly if the profile of a scholar is not regularly manually cleaned. Since we will rely on the number of publis in common between every pair of my coauthors to build the network we need to ensure, as far as possible, insensitive strings comparison. More specifically, if two article titles are very similar like “Human Mobility: Models and Applications” and “Human mobility : models and applications” for example, we want to consider them as a unique publication. For this purpose, I wrote the function simat that returns a matrix of similarities between two vectors of character strings li and lj. The element ij of the matrix is the fraction of letters in common between the ith string of li and the jth string of lj.

simat=function(li, lj){ 

    res=matrix(0,length(li),length(lj))
    for(i in 1:length(li)){
        for(j in 1:length(lj)){
            split1=unlist(strsplit(tolower(li[i]), ""))
            split2=unlist(strsplit(tolower(lj[j]), ""))
            res[i,j]=2*length(vintersect(split1,split2))/(nchar(tolower(li[i]))+nchar(tolower(lj[j]))) 
        } 
    }    

    return(res)

}

We can use this function to create two functions duplipubli and intersectpubli to remove doublons from a vector of article titles and compute the number of publis in common between two authors based on their vectors of article titles, respectively. I added the possibility to adjust a threshold value to determine if two strings correspond or not to the same article. After a few test I found that a threshold of 0.95 gives satisfying results. For example, the comparison between “Human Mobility: Models and Applications” and “Human mobility : models and applications” returns a score of 0.987.

duplipubli computes the similarity matrix and removes iteratively the doublons (strings with similarity metric higher than the defined threshold value).

duplipubli=function(li, threshold){ 

    n=length(li)
    res=simat(li, li) 

    i=1
    test=(sum(res>threshold)>n)
    while(test){
        n=length(li)
        indupl=as.numeric(which(res[i,]>threshold))
        if(length(indupl)>1){
            li=li[-indupl[-1]] 
            res=res[-indupl[-1],-indupl[-1]]     
        }
        i=i+1 
        test=(sum(res>threshold)>n)
    }

    return(li)

}

intersectpubli computes the similarity matrix and the number of strings in common between two string vectors.

intersectpubli=function(li, lj, threshold){ 

    res=simat(li, lj)
  
    return(sum(res>threshold))

}

We can now use these functions and the function get_publications (package scholar) to build the network by computing, for each pair of scholars, their number of articles in common. get_publications takes as inputs a Google Scholar id. Note that I filter out entries without year of publication.

The network is stored in the dataframe net.

net=NULL
for(i in 1:(nco-2)){

    # Extract list of articles of scholar i
    idi=co[i,3]
    li=get_publications(idi, cstart = 0, pagesize = 100, flush = FALSE)
    li=li[!is.na(li$year),]   # Remove articles without publication year
    li=as.character(li$title)
    li=duplipubli(li, threshold=0.95)  # Remove doublons
    Sys.sleep(1)

    for(j in (i+1):nco){

        # Extract list of articles of scholar j
        idj=co[j,3]
        lj=get_publications(idj, cstart = 0, pagesize = 100, flush = FALSE)
        lj=lj[!is.na(lj$year),]   # Remove articles without publication year
        lj=as.character(lj$title)   
        lj=duplipubli(lj, threshold=0.95)  # Remove doublons
        Sys.sleep(1)

        # Add the number of articles in common between scholar i and j to the network
        net=rbind(net, c(co[i,1], co[j,1], intersectpubli(li,lj, threshold=0.95)))
        
    }

}

We now set the node weights W according to the number of publis in common with me. Of course the number of publis that I have with myself is my actual number of publis. Since we don’t have this information yet I use the function get_num_articles (package scholar) to retrieve it. For my coauthors this information is available in the first #coauthors - 1 weights of the network.

co[1,5]=get_num_articles(co[1,3])   
co[2:(dim(co)[1]),5]=net[1:(dim(co)[1]-1),3] 

Design and create the network

The function forceNetwork (package networkD3) create a D3 JavaScript network graph based on a set of nodes and links and their attributes (name, group, size for the nodes and value for the links). We therefore need to format the two tables co and net by selecting only the name, group and size of the node,

co=co[,c(2,4,5)]
colnames(co)=c("name","group","size")

and the links with at least one publi in common.

net=net[net[,3]>0,]
colnames(net)=c("source","target","value")

Since the links will be plotted in the order they appear in net we need to reverse them if we want to put my links on the top.

net=net[dim(net)[1]:1,] 

We define two colors of link, grey for the links between my coauthors and blue for the links between my coauthors and me.

colo=rep("lightgrey",dim(net)[1])
colo[ (dim(net)[1] - (dim(co)[1]-2)):dim(net)[1] ]="#1F77B4"

We then build the network with forceNetwork. Many aspects of the network such as the distance between nodes can adjusted with the parameters of forceNetwork.

G=forceNetwork(Links=net, Nodes=co, NodeID = "name", Group = "group",
        
        # Custom nodes and labels  
        Nodesize="size",                                                    # column names that gives the size of nodes  
        radiusCalculation = JS("d.nodesize/2+5"),                           # How to use this column to calculate radius of nodes? (Java script expression)  
        opacity = 1,                                                        # Opacity of nodes when you hover it  
        opacityNoHover = 0,                                                 # Opacity of nodes you do not hover  
        colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"),          # Javascript expression, schemeCategory10 and schemeCategory20 work  
        fontSize = 20,                                                      # Font size of labels  
        fontFamily = "sans serif",                                          # Font family for labels  
 
        # Custom edges  
        Value="value",
        arrows = FALSE,                                                     # Add arrows?  
        linkColour = colo,                                                  # colour of edges  
        linkWidth = JS("function(d) { return Math.sqrt(d.value); }"),       # edges width  
 
        # Layout  
        linkDistance = 100,                                                 # link size, if higher, more space between nodes  
        charge = -30,                                                       # if highly negative, more space between nodes  
                
        # General parameters  
        height = NULL,                                                      # height of frame area in pixels  
        width = NULL,
        zoom = FALSE,                                                       # Can you zoom on the figure  
        legend = FALSE,                                                     # add a legend?  
        bounded = TRUE, 
        clickAction = NULL
        
)

Export the network

You can finally export the network in html with the following piece of code.

htmlwidgets::saveWidget(G,"Coauthorship.html")

I inserted it on my website made with Jekyll using the code below.

<p style="margin-left:-150px;margin-bottom: 0px;">
    <iframe width="115%" height="450px" src="/assets/Coauthorship.html" frameborder="0"></iframe>
</p> 

The scripts are available on my website.