This document will introduce you to social network analysis (SNA), different network measures as well as network visualizations. You will be using a cleaned dataset and will be creating your own to put your knowledge into practice. This tutorial was created by Cristina Chueca Del Cerro for the Introduction to Network Analysis lecture part of the MSc course Practicing Research and Working with Data in the Digital Age 2021-2022 (SOCIO5102) at the University of Glasgow in October 2021.
Setting up our R Project
Before we get started you should create an R Project in your working directory, the place to keep the data in. You can either press on File> New Project… or click on the second top-left icon (blue square with a + sign and an R). For further details on how to create your R project and selecting your working directory, see here. Make sure that the data we’ll be using for this tutorial is in the project folder you have selected, otherwise R will get angry at you because it cannot find the data your looking for. The data file can be found under week 6 of the course’s Moodle page here.
The next step is to open a blank script, the first icon top-left bar or press File > New script. On this new script you will need to install and load the necessary libraries. These libraries allow R to read, manipulate, analyse and visualize your data. You will need to remove the ‘#’ in order to run the command in R. Hashtags allow you to write notes alongside your code, to make sense of it and to help others work out your logic.
# Installing required libraries - only run it once as the library will be forever installed in the R version you have
#install.packages("igraph")
#install.packages("ggnetwork")
#install.packages("visNetwork")
#install.packages("ggplot2")
# Following installation you need to request those libraries - using either: library(name) or require(name)
library(ggplot2)
require(igraph)
require(ggnetwork)
Loading the data and summary statistics
Once the libraries are loaded it’s time to import our dataset. We’ll be working with a .csv file which is the most common data file type. Note that we have two separate files reflecting the different properties of our social network.
On the one hand, data_edges
represents the relationships between people in the network. It has four columns: en1 and end2 represent who is friends with who; the weight of the relationship (how close those two people are); and where did these pairs of people meet(online or offline).
On the other hand, data_attribs
has the individual attributes of each of the people in our network. These attributes are the person’s national identity attitude, sex, and age.
# Loading the dataset for this tutorial
data_attribs <- read.csv("attribs_demo.csv",header = TRUE, stringsAsFactors = T)
data_edges <- read.csv("edges_demo.csv", header = TRUE, stringsAsFactors = F)
#View(data_attribs) allows us to see the data on a separate tab
#View(data_edges)
Every time we import any data into R we need to ensure it’s in the right format for us to be able to manipulate it. The first thing we do is check whether the data is in the intended format, and if not we can change it. We use str()
to accomplish this. We can see that we have imported two dataframes with four variables but our relationship variable from data_edges
has been read as character when it should be factor. We can change that by using as.factor
and if we run str()
we can see that it has changed correctly.
#Let's check data_attribs
str(data_attribs)
## 'data.frame': 100 obs. of 4 variables:
## $ Person: int 3 27 22 77 48 6 41 38 98 94 ...
## $ nat_id: num -0.1257 -0.2072 0.4831 -0.5212 0.0998 ...
## $ sex : Factor w/ 2 levels "F","M": 1 2 1 2 2 2 2 2 2 2 ...
## $ age : int 61 74 22 71 39 59 42 19 20 41 ...
# Let's check data_edges
str(data_edges)
## 'data.frame': 872 obs. of 4 variables:
## $ end1 : int 100 100 100 100 100 100 100 100 100 100 ...
## $ end2 : int 1 2 3 4 5 10 11 12 14 17 ...
## $ weight : num 0.336 0.949 0.313 0.193 0.269 ...
## $ relationship: chr "online_friendships" "online_friendships" "online_friendships" "online_friendships" ...
data_edges$relationship <- as.factor(data_edges$relationship)
Note this issue only happened to data_edges
and not to data_attribs
. But why? Because when we imported the dataset we told R stringAsFactor
to be true, meaning that any character/string found in our dataset would automatically be converted whereas for data_edges
we said false. You should always check your dataset conversion when importing data even if you use stringAsFactors
as R can make misread data.
The next thing is to run some summary statistics on our datasets to get an indication of what’s going on inside our data. We can do this with the function summary()
. In doing so, we could uncover potential outliers but in this case there aren’t because it’s a cleaned dataset.
summary(data_attribs)
## Person nat_id sex age
## Min. : 1.00 Min. :-1.00000 F:48 Min. :18.00
## 1st Qu.: 25.75 1st Qu.:-0.40443 M:52 1st Qu.:34.75
## Median : 50.50 Median :-0.11491 Median :54.00
## Mean : 50.50 Mean :-0.04817 Mean :49.89
## 3rd Qu.: 75.25 3rd Qu.: 0.27971 3rd Qu.:65.25
## Max. :100.00 Max. : 1.00000 Max. :79.00
summary(data_edges)
## end1 end2 weight relationship
## Min. : 1.00 Min. : 1.00 Min. :0.0000905 offline_friendships:357
## 1st Qu.: 12.00 1st Qu.:46.00 1st Qu.:0.2401220 online_friendships :515
## Median : 28.00 Median :65.00 Median :0.4848960
## Mean : 34.21 Mean :61.61 Mean :0.4888084
## 3rd Qu.: 50.25 3rd Qu.:80.00 3rd Qu.:0.7199639
## Max. :100.00 Max. :99.00 Max. :0.9992411
Network Measures
We need to create a network object from our dataset. We can do so by using graph_from_data_frame
and specifying whether our network is directed or undirected. In this case, we have a undirected network since relationships are bi-directional. We can check the number of edges (or links) and nodes (or vertices) as well as the shortest path and diameters of the network.
We can also check the degree of a node, the number of edges a given node has. This will become important for our network measures. The average degree is a helpful indicator of how dense is our network since it calculates the number of edges per node on average.
#Creating our network object
g <- graph_from_data_frame(data_edges, directed = F)
# Vertex and edge counts
vcount(g)
## [1] 100
ecount(g)
## [1] 872
# Degree: number of edges a given node has & average degree
V(g)$degree <- degree(g)
mean(degree(g))
## [1] 17.44
We can see that there are a 100 nodes (vertices) while there are 872 edges (links). On average, each of these nodes has 17.44 edges which means over 17 friends. With this information we can calculate the network density. Network density or the ratio of the number of edges that exist between two nodes and the number of possible edges that could exist between two nodes.
Since we have an undirected network, N denotes the size of the network calculated as N = \(\frac{N (N - 1)}{2}\). E corresponds to the number of observed edges in the network. From there, density D can be expressed as \[D = \frac{2E}{N (N -1)} \] or we can ask R to work it out for us.
##### NETWORK MEASURES ####
# Network Density - how many edges exist out of the potential edges in a network
edge_density(g)
## [1] 0.1761616
The edge density for our network is 0.18 meaning that only 18% of the potential ties between all of our nodes are possible. The main takeaway from this is that our network presents a divided structure with isolated actors.
To further explore this we move on to network centrality measures. We find the average path length, or mean distance in R terms. This is one of the most robust network measures since the shorter this path is, on average, the more interconnected and efficient a network is.
From that measure, calculated as d (v_i,v_j) or the distance between two nodes representing the length of the shortest path between both vertices (nodes). Then, we take the sum of all the shorterst paths between all the vertices and divide that by the number of possible paths. Resulting in \[l_G = \frac{1}{n *(n-1)} * \sum_{i \neq j} d(v_i, v_j) \] or once more we can ask R to calculate it for us and save us time.
##### NETWORK CENTRALITY MEASURES ####
#Average Path Length / Degree centrality - how close are two nodes based on the number of edges they have in common
#distances(g) #length of shortest path of all nodes
mean_distance(g)# average path length of all nodes
## [1] 2.208485
The average path length for our network is 2.20 meaning that on average the number of edges in each path is 2.20 or that it takes two steps from vertex \(v_i\) to reach vertex \(v_j\).
The diameter, conversely, tell us the distance between two nodes that are the furthest apart from each other. In other words,the longest path between two nodes.
#Network diameter - same principle but for nodes that are distant from each other
diameter(g, directed=F)#- if we didn't have weights we should add weights = NA
## [1] 0.9607175
Continuing with measures of centrality we find closeness centrality and betweenness centrality. The first measure, closeness tell us how close is a node to the rest of the nodes in the network. The higher the closeness, the more cohesive the network.
Alternatively, betweeness centrality tells us the role of the node in relation to the rest of the nodes in the network. In other words, whether that node is central and serves as a bridge to other nodes. Betweenness is important since central nodes act as bridges connecting otherwise disconnected nodes in the network which means they control what information passes through the network.
# Closeness: distance to others in the graph
V(g)$closeness <- closeness(g)
max(V(g)$closeness)
## [1] 0.01929266
min(V(g)$closeness)
## [1] 0.009599216
# Betweenness: centrality based on a broker position connecting others
V(g)$betweenness <- betweenness(g)
max(V(g)$betweenness)
## [1] 1758
min(V(g)$betweenness)
## [1] 0
Visualising our network
The next step is to visualize this network. Building plots can be challenging since there are many details to look out for from labels to size and color. Below you can find various examples of building a plot from scatch and see the difference a few commands can make to our visualisations.
#Improving our visualisation
plot(g, layout = layout_with_fr, vertex.label = "", edge.arrow.size=.5,
vertex.color="orange", vertex.size=6,
vertex.frame.color="white", edge.curved=0.2)
#Trying out different layouts - substitute layout for layout1, layout2, layout3 and see what happens!
layout <- layout_on_grid(g, width = 20)
layout1 <- layout_in_circle(g)
layout2 <- layout_with_mds(g)
layout3 <- layout_components(g)
#for more layout info - check out the layout documentation
plot(g, layout = layout, vertex.label = "", vertex.color="orange", vertex.size=7,
vertex.frame.color="white", edge.curved=0.3)
Plotting with ggplot2
and ggnetwork
ggnetwork
takes visualising networks on R to a new level since it uses ggplot2 as it base. This first plot is a simple visualization of our network with the nodes in light blue color and the edges in grey. You can see that depending on the node degree, the size of the nodes changes. In doing so, we can see those nodes that have a lot of edges, or high degree, being larger compared to those nodes that have fewer edges, smaller degree.
ggplot(g, layout = with_dh(), aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(color = "grey50", alpha = 0.5) +
geom_nodes(aes(size = degree), alpha = 0.8, color = "skyblue") +
theme_blank()
We can upgrade our plot by highlighting nodes based on their degree centrality. The nodes colored in redish/rose act as brokers meaning they serve as bridges between different groups connecting otherwise disconnected nodes. They have the power to share information between communities who would otherwise be disconnected. Broker actors serve as a bridge between two groups.
ggplot(g, layout = with_fr(), aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(color = "grey50", alpha = 0.5) +
geom_nodes(aes(size = degree, color = ifelse(degree < 20, "broker", "other")), alpha = 0.8) +
# geom_nodelabel_repel(aes(label = ifelse(degree < 20 , name, NA)), alpha = 0.8) +
theme_blank()
Alternatively, we can highlight nodes based on betweenness centrality as well as using a different layout algorithm to visualise your network. Also note how the label for our color legend was changed from the previous plot making it nicer and clearer what is being represented. In this case we’re focusing on betweeness centrality to see who acts as a bridge between network communities. These nodes are colored red whereas those nodes that have a low betweeness centrality, meaning they are less connected to the rest of the network are blue.
ggplot(g, layout = nicely(), aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(color = "grey50", alpha = 0.5) +
geom_nodes(aes(size = degree, color = ifelse(betweenness > 100,"broker", "other")), alpha = 0.8, size = 2) +
scale_color_manual(name="Betweeness centrality",values=c("red","blue"),labels=c("Bridge (BC >100)","other (BC <100)")) +
theme_blank()
Making your own network
Sometimes you may want to create your own network from observational or survey data. In this case I randomly created a directed network between seven people. Then, I used the ggplot commands to visualise the network and colored the nodes depending on the betweenness centrality. We can see that Anne is a bridge connecting otherwise disconnected others. Visualising networks helps us make sense of what the data is telling us alongside the network measures discussed previously.
#We're gonna create a directed network of random pairs
demo <- rbind(c("Anne","Bruce"),
c("Anne","Charlie"),
c("Bruce","Charlie"),
c("Danielle","Edward"),
c("Danielle","Francis"),
c("Edward","Francis"),
c("Anne","Danielle"),
c("Charlie", "Luke"),
c("Francis", "Danielle"))
demo_graph <- graph_from_edgelist(demo, directed = FALSE)
plot(demo_graph)
V(demo_graph)$betweenness <- betweenness(g)
ggplot(demo_graph, aes(x, y, xend = xend, yend = yend)) +
geom_edges(colour = "grey50") +
geom_nodes(aes(colour = betweenness), size = 3) +
scale_colour_gradient(low = "gold", high = "tomato") +
theme_blank() +
geom_nodetext(aes(label = name))
Final remarks
Hope you find this tutorial useful and you end up using SNA for your research. Please do not hesitate to contact me should you have any questions or issues (c.chueca-del-cerro.1@research.gla.ac.uk).