Over the past few weeks, I have been trying to learn the basics of statistical analysis of graphs, more specifically, complex networks. Here I must be honest, with high school level algebra as my only mathematical tool, trying to work through an article such as Albert and Barabási’s “Statistical mechanics of complex networks” is a daunting task. However, I have been able to get through the basic concepts and begin applying them to my work.

Something I found particularly interesting is the concept of Degree Distribution. As we already know, the degree of a node refers to the number of edges that are connected to that node, and not all of the nodes in a graph have the same degree. The distribution of degree in a network is “characterized by a distribution function P(k), which gives the probability that a randomly selected node has exactly k edges” (Albert, Barabási 2002). The degree distribution of a random graph, such as the variants of the Erdős–Rényi model, is a Poisson distribution, because “in a random graph the edges are placed randomly, the majority of nodes have approximately the same degree, close to the average degree <k> of the network” (Albert, Barabási 2002). However, is has been demonstrated that many real world graph’s degree distribution differ greatly from the Poisson distribution, instead, they demonstrate a power-law tail, *P*(*k*) ~ *k*^{−γ} ,and are called scale free networks^{ }(Albert, Barabási 2002).

Thinking about this, I became curious about the Preliminaries graph and its degree distribution. Preliminaries is a real-world network, its based on real information gathered from physical objects; however, the process of designing the schemas and relationships requires a fair amount of human manipulation. More about Preliminaries here, here, and here. So I decided to find the Prelims degree distribution and compare it with some random graphs. Now, as I said, my math skills are lacking a bit, something I plan on really working on over the next few years, but I do have what is takes to model degree distribution using the Python modules NetworkX and Pylab. Here I will briefly describe my methodology(code) and the resulting plots.

First I decided to generate two random graphs and plot their degree distribution:

GNP Graph (Erdős–Rényi model):

This graph is generated by inputing the number of nodes in the graph and the probability that there is an edge between each pair of nodes. Because I wanted to imitate the node and edge count of the Prelims graph, I found a probability that would generate approximately 3464 edges. The formula for computing the probability is included in the code snippet used to generate the graph:

import pylab as pl from networkx import * ### generate gnp_random_graph ### n = number of nodes ### m = expected number of edges n = 1616 m = 3464 ### p = probablity of edge creation ### m = p*n(n-1) ### 3464 = p*1616(1615) ### p = 0.0013272844312295006 p = 0.0013272844312295006 # generate graph G = gnp_random_graph(n,p,directed=True) # print basic stats print ("Number of Nodes : %i" % (n)) print ("Number of Edges : %i" % (number_of_edges(G))) # make a list of each node's degree degree_list = list(G.degree().values()) # compute and print average node degree print ("Avg. Node Degree: %f" % (float(sum(degree_list))/n)) # generate a list degree distribution degree_hist = degree_histogram(G) if len(degree_hist) < 15: print ("Degree Fequency List:") print ("Degree : # of Nodes") # print the degree and number of nodes that have that degree for degree,number_of_nodes in enumerate(degree_hist): print ("%i : %i" % (degree,number_of_nodes)) else: print ("Degree Frequency List Too Long to Print") # generate x,y values for degree dist. scatterplot x_list = [] y_list = [] for degree,num_of_nodes in enumerate(degree_hist): if num_of_nodes > 0: x_list.append(degree) y_list.append(num_of_nodes) # label the graph pl.title('Degree Distribution\nGNP Graph') pl.xlabel('Degree') pl.ylabel('Frequency') # plot degree distribution pl.scatter(x_list,y_list) pl.show()

This script results in the terminal output:

Number of Nodes : 1616 Number of Edges : 3605 Avg. Node Degree: 4.461634 Degree Fequency List: Degree : # of Nodes 0 : 17 1 : 76 2 : 201 3 : 284 4 : 299 5 : 259 6 : 218 7 : 119 8 : 82 9 : 32 10 : 19 11 : 7 12 : 2 13 : 1

And the following scatter plot:

As you can see this resembles a Poisson distribution, like this one taken from the WolframAlpha website:

Next I generated a random scale free graph. The script I used was very similar to the previous script, except I used a different graph generator with only the node count as a parameter and I set the tighter limits for the x and y axes:

n = 1616 G = scale_free_graph(n) # set limits for the axes pl.gca().set_xlim([-10,70]) pl.gca().set_ylim([-10,120])

This script results in the following terminal output:

Number of Nodes : 1616 Number of Edges : 3428 Avg. Node Degree: 4.242574 Degree Frequency Too Long to Print

And the scatter plot:

This plot resembles a power law tail, such as this one from WolframAlpha:

Scale free distributions are commonly plotted in using log-log plots, such as those used by Albert and Barabási in the previously mentioned article. To produce a log-log plot, you can simply change the Pylab scale to log, also for better visualization change the axes limits:

# set limits for the axes pl.gca().set_ylim([0.9,1000]) pl.gca().set_xlim([0.9,1000]) # log-log plot pl.gca().set_xscale("log") pl.gca().set_yscale("log")

The random scale free graph plotted in log-log looks like this:

So, now that we have seen what a PNG random graph and a scale free random graph degree distribution looks like, let’s take a look at the degree distribution of the Preliminaries graph. Although I should be able to read the Prelims .gexf file with NetworkX, I was generating error after error, so I decided to simply use the Gephi scripting console to generate a .txt file with the degree of each node. This should have been fairly straightforward, but I found the formatting of the degree values in the .txt file to be extremely difficult to work with, so I wrote a fairly ugly script to clean up the data so I can process it and generate plots for degree distribution. Using the following script I was able to generate some plots:

import pylab as pl def degree_distribution(degree_list): """ set up a dictionary with degree as key and frequency as value """ degree_dict = {} for degree in node_list: degree_dict.setdefault(degree,0) degree_dict[degree] += 1 return degree_dict f = open('prelims_degree.txt','r') line = f.readline().split() f.close() # clean up the data clean_line = [] for degree in line: degree = list(degree) degree.pop() degree = ''.join([num for num in degree]) clean_line.append(degree) # make a list to be used in the in the- # degree_distribution function degree_list = [] for degree in clean_line: try: degree = int(degree) degree_list.append(degree) except ValueError: pass # compute and print basic graph stats # num. of nodes, egdes, and average degree avg_node_degree = float(sum(degree_list))/len(degree_list) print ("Number of Nodes : %i" % (len(degree_list))) print ("Number of Edges : %f" % (len(degree_list)*avg_node_degree/2)) print ("Avg. Node Degree: %f" % (avg_node_degree)) # set up a dict with degree frequency values degree_dict = degree_distribution(degree_list) # generate x,y values for degree dist. scatterplot x_list = [] y_list = [] for degree,frequency in degree_dict.items(): x_list.append(degree) y_list.append(frequency) # label the graph pl.title('Degree Distribution\nPrelims Graph') pl.xlabel('Degree') pl.ylabel('Frequency') # set limits for the axes pl.gca().set_xlim([-10,75]) pl.gca().set_ylim([-10,125]) # plot degree distribution pl.scatter(x_list,y_list) pl.show()

Which generates the following scatter-plot:

This plot looks like it also has the power law tail, although I can’t be sure, and as you can see is quite similar to the random scale free graph’s degree distribution. Alternatively, if we plot the Prelims data in a log-log plot we generate the following image:

Once again we see that the Preliminaries graph’s degree distribution is quite similar to the random scale free graph. This leads me to believe that the Preliminaries graph is indeed scale free, as many real world networks are. What does this mean? Preliminaries is a ‘real’ network? If anything it further validates the study of early cultural production using a network based methodology, as we can see that the network we have generated for this study does indeed share characteristics with modern day networks, and thus provides a comparative methodology for analyzing network evolution throughout human history.

all the code for this post is available in this gist