Tag Archives: Python

Math is the Path: Degree Distribution of the Prelims Network and Other Random Graphs

Over the past few weeks, I have been trying to learn the basics of statistical analysis of graphs, more specifically, complex networks. Here I must be honest, with high school level algebra as my only mathematical tool, trying to work through an article such as Albert and Barabási’s “Statistical mechanics of complex networks” is a daunting task. However, I have been able to get through the basic concepts and begin applying them to my work.

Something I found particularly interesting is the concept of Degree Distribution. As we already know, the degree of a node refers to the number of edges that are connected to that node, and not all of the nodes in a graph have the same degree. The distribution of degree in a network is “characterized by a distribution function P(k), which gives the probability that a randomly selected node has exactly k edges” (Albert, Barabási 2002). The degree distribution of a random graph, such as the variants of the Erdős–Rényi model, is a Poisson distribution, because “in a random graph the edges are placed randomly, the majority of nodes have approximately the same degree, close to the average degree <k> of the network” (Albert, Barabási 2002). However, is has been demonstrated that many real world graph’s degree distribution differ greatly from the Poisson distribution, instead, they demonstrate a power-law tail, P(k) ~ kγ ,and are called scale free networks (Albert, Barabási 2002).

Thinking about this, I became curious about the Preliminaries graph and its degree distribution. Preliminaries is a real-world network, its based on real information gathered from physical objects; however, the process of designing the schemas and relationships requires a fair amount of human manipulation. More about Preliminaries here, here, and here. So I decided to find the Prelims degree distribution and compare it with some random graphs. Now, as I said, my math skills are lacking a bit, something I plan on really working on over the next few years, but I do have what is takes to model  degree distribution using the Python modules NetworkX and Pylab. Here I will briefly describe my methodology(code) and the resulting plots.

First I decided to generate two random graphs and plot their degree distribution:

GNP Graph (Erdős–Rényi model):

This graph is generated by inputing the number of nodes in the graph and the probability that there is an edge between each pair of nodes. Because I wanted to imitate the node and edge count of the Prelims graph, I found a probability that would generate approximately 3464 edges. The formula for computing the probability is included in the code snippet used to generate the graph:


import pylab as pl
from networkx import *

### generate gnp_random_graph
### n = number of nodes
### m = expected number of edges

n = 1616
m = 3464

### p = probablity of edge creation
### m = p*n(n-1)
### 3464 = p*1616(1615)
### p = 0.0013272844312295006

p = 0.0013272844312295006
# generate graph
G = gnp_random_graph(n,p,directed=True)

# print basic stats
print ("Number of Nodes : %i" % (n))
print ("Number of Edges : %i" % (number_of_edges(G)))

# make a list of each node's degree
degree_list = list(G.degree().values())

# compute and print average node degree
print ("Avg. Node Degree: %f" %
 (float(sum(degree_list))/n))

# generate a list degree distribution
degree_hist = degree_histogram(G)
if len(degree_hist) < 15:
 print ("Degree Fequency List:")
 print ("Degree : # of Nodes")

# print the degree and number of nodes that have that degree
 for degree,number_of_nodes in enumerate(degree_hist):
   print ("%i : %i" % (degree,number_of_nodes))
else:
 print ("Degree Frequency List Too Long to Print")

# generate x,y values for degree dist. scatterplot
x_list = []
y_list = []
for degree,num_of_nodes in enumerate(degree_hist):
 if num_of_nodes > 0:
 x_list.append(degree)
 y_list.append(num_of_nodes)

# label the graph
pl.title('Degree Distribution\nGNP Graph')
pl.xlabel('Degree')
pl.ylabel('Frequency')

# plot degree distribution
pl.scatter(x_list,y_list)
pl.show()

This script results in the terminal output:

Number of Nodes : 1616
Number of Edges : 3605
Avg. Node Degree: 4.461634
Degree Fequency List:
Degree : # of Nodes
0 : 17
1 : 76
2 : 201
3 : 284
4 : 299
5 : 259
6 : 218
7 : 119
8 : 82
9 : 32
10 : 19
11 : 7
12 : 2
13 : 1

And the following scatter plot:

GNP_graph

As you can see this resembles a Poisson distribution, like this one taken from the WolframAlpha website:

Poisson

Scale Free Random Graph:

Next I generated a random scale free graph. The script I used was very similar to the previous script, except I used a different graph generator with only the node count as a parameter and I set the tighter limits for the x and y axes:


n = 1616
G = scale_free_graph(n)

# set limits for the axes
pl.gca().set_xlim([-10,70])
pl.gca().set_ylim([-10,120])

This script results in the following terminal output:


Number of Nodes : 1616
Number of Edges : 3428
Avg. Node Degree: 4.242574
Degree Frequency Too Long to Print

And the scatter plot:

TheScaleFree

This plot resembles a power law tail, such as this one from WolframAlpha:

PowerLawWol

Scale free distributions are commonly plotted in using log-log plots, such as those used by Albert and Barabási in the previously mentioned article. To produce a log-log plot, you can simply change the Pylab scale to log, also for better visualization change the axes limits:


# set limits for the axes
pl.gca().set_ylim([0.9,1000])
pl.gca().set_xlim([0.9,1000])

# log-log plot
pl.gca().set_xscale("log")
pl.gca().set_yscale("log")

The random scale free graph plotted in log-log looks like this:

NewScaleLog

So, now that we have seen what a PNG random graph and a scale free random graph degree distribution looks like, let’s take a look at the degree distribution of the Preliminaries graph. Although I should be able to read the Prelims .gexf file with NetworkX, I was generating error after error, so I decided to simply use the Gephi scripting console to generate a .txt file with the degree of each node. This should have been fairly straightforward, but I found the formatting of the degree values in the .txt file to be extremely difficult to work with, so I wrote a fairly ugly script to clean up the data so I can process it and generate plots for degree distribution. Using the following script I was able to generate some plots:

import pylab as pl

def degree_distribution(degree_list):
 """
 set up a dictionary with degree as key
 and frequency as value
 """
 degree_dict = {}
 for degree in node_list:
   degree_dict.setdefault(degree,0)
   degree_dict[degree] += 1
 return degree_dict

f = open('prelims_degree.txt','r')
line = f.readline().split()
f.close()

# clean up the data
clean_line = []
for degree in line:
 degree = list(degree)
 degree.pop()
 degree = ''.join([num for num in degree])
 clean_line.append(degree)

# make a list to be used in the in the-
# degree_distribution function
degree_list = []
for degree in clean_line:
 try:
   degree = int(degree)
   degree_list.append(degree)
 except ValueError:
   pass

# compute and print basic graph stats
# num. of nodes, egdes, and average degree
avg_node_degree = float(sum(degree_list))/len(degree_list)
print ("Number of Nodes : %i" % (len(degree_list)))
print ("Number of Edges : %f" %
 (len(degree_list)*avg_node_degree/2))
print ("Avg. Node Degree: %f" %
 (avg_node_degree))

# set up a dict with degree frequency values
degree_dict = degree_distribution(degree_list)

# generate x,y values for degree dist. scatterplot
x_list = []
y_list = []
for degree,frequency in degree_dict.items():
 x_list.append(degree)
 y_list.append(frequency)

# label the graph
pl.title('Degree Distribution\nPrelims Graph')
pl.xlabel('Degree')
pl.ylabel('Frequency')

# set limits for the axes
pl.gca().set_xlim([-10,75])
pl.gca().set_ylim([-10,125])

# plot degree distribution
pl.scatter(x_list,y_list)
pl.show()

Which generates the following scatter-plot:

NewPrelimLin

This plot looks like it also has the power law tail, although I can’t be sure, and as you can see is quite similar to the random scale free graph’s degree distribution. Alternatively, if we plot the Prelims data in a log-log plot we generate the following image:

PrelimLog

Once again we see that the Preliminaries graph’s degree distribution is quite similar to the random scale free graph. This leads me to believe that the Preliminaries graph is indeed scale free, as many real world networks are. What does this mean? Preliminaries is a ‘real’ network? If anything it further validates the study of early cultural production using a network based methodology, as we can see that the network we have generated for this study does indeed share characteristics with modern day networks, and thus provides a comparative methodology for analyzing network evolution throughout human history.

@dbrownbeta

all the code for this post is available in this gist

Leave a Comment

Filed under Uncategorized

Winter Break @CulturePlex

Well it’s already the 20th of December, classes have been over for two weeks and things are a bit more relaxed around here at Western. Campus is strangely deserted and it’s hard to get that 4 p.m. cup of coffee that you know every grad student requires for survival. Everyone is closing up shop, mopping the floor and stepping out early. It’s that time of year.

What is dbrownbeta doing for vacations…drinking margaritas in Cabo…or perhaps a bit of SCUBA in Roatán—maybe mounting a quick roadtrip to Utah?

Not this year folks. Just sitting tight in Canada waiting for the snow.

I am taking advantage of this time to get a few things done. When school is in session it’s tough to get much real work done with the nonstop itinerary of classes and meetings and readings and speakers. Your job is grad school and grad school is your life. Does that make sense? So your job is really to live your life of a grad student. Confusing? Yes it confuses me as well.

Moving on to more technical and less ridiculous topics, I want to talk a bit about what I am doing this Winter Break. Let’s do this.

Finishing Up My CourseWork

Last weekend I finished my project for the class the Máquina cultural. Although the essay wasn’t my finest work, the models I made turned out quite nicely, and I got a chance to experiment with Gephi’s Geo Layout. And I got to add a new function to my Gephi/Python library.

The graph consisted of the metadata of a corpus of literary and critical texts laid out in Gephi. The majority of nodes were just standard nodes with normal attributes, however, the nodes that represented geographical locations (cities) were arranged based on their lat/long attributes using Gephi’s Geo Layout.

These geonodes were then fixed in place using my new fix_set function in combination with other functions from the Gephi/Python library. Then the other nodes were arranged around the geonodes using ForceAtlas 2.

Pretty neat huh?

CulturePlex Projects

I am also working on a few CulturePlex projects this Winter Break. I recently received the chance to help out with the Sylva project. The lab is getting ready to officially release Sylva to the public, and because one of the priorities of Sylva is ease of use, we want to provide comprehensive documentation. I am helping to develop the content of this documentation. We are working on creating three types of documentation: a user guide that describes all of the features of Sylva, a step by step tutorial to creating your first graph with Sylva, and a help menu with FAQs, solutions, etc.

We are also beginning work on a new period of the Preliminaries project. In this phase we will focus on the time period of 1643-1661 during the administration of Luis de Haro. We are particularly interested in this period because this is when Pedro Calderón de la Barca began to be published prolifically. In this case, the graph will be used not only for general network analysis, but also as a supplement to studies on the contemporary reception of Calderón’s work. We have just barely begun to assemble the first editions list for this phase, but we plan to have it finished before May.

Personal Projects

I have three personal projects this winter break: learn HTML/CSS, learn JavaScript for use in web pages and the Google Maps API, and build a personal web page. I started learning HTML last Sunday evening, and my colleague Roberto showed me on the Bootstrap on Wednesday. Bit by bit, my website is coming along:

 

It’s called xitōmatl and it will provide links to my social networking sites, descriptions of my projects (personal and CulturePlex) with their associated image galleries, my personal profile and CV, etc. Also, I plan on creating a page that focuses specifically on the research of New Spain. Here I will provide a variety of content supplemented with links to digitized rare New Spanish books, various websites useful in the study of New Spain, and a few resources for learning Classical Nahuatl (another project coming soon). xitōmatl is available at this Gist if you want to take a look. It’s still a bit sloppy (a bunch of style elements that need to be is a CSS), but you get the idea.

That’s it for today…time to get back to work.

Happy Holidays

@dbrownbeta

 

 

Leave a Comment

Filed under Uncategorized

Traversing the Graph: the Coded Text of the Prelims Project

Hello and welcome back! Two weeks have passed and it is time go back to the Preliminaries Project graph and take a fresh look at what  lies hidden within its multicolored mini-universe of nodes and edges and digitally encoded information. Sound daunting? It is. Coming from a primarily humanities background, I still retain a slight fear of the digital world; it intrigues me and at the same time  it frightens me. Doesn’t look like a book…Doesn’t smell like a book…Hey that’s not a book! Humanities, welcome to the 21st century.

Although my Gephi generated friend and  constant companion the Preliminaries graph is most definitely NOT a book, it does have some similarities with something like–I don’t know–maybe a historical text written in Ancient Greek: it contains a lot of information if you know how to read it. For a guy like me, reading the Preliminaries graph is a ground-up process, I have to decode as I go. Obviously there are a variety of excellent texts and websites to learn about social network analysis; however, it is important to recognize that each data set is unique and to approach it as such. This blog post will share the experience of my first attempt to crack the code of the Preliminaries graph.

A quick note about functions:

From now on I will include any code that I write as a hyperlink to a Gist in which the code is stored. Feel free to download them, play with them, change them, or leave comments. Remember I am a Python (programming) rookie so please be kind!

Ever since I started looking at these kinds of graph I have been particularly interested in the measure Betweenness Centrality because I feel that it is an effective way of measuring a node’s importance in a graph. It is a measure of the number of Shortest Paths that pass through the node, and is an excellent measure of the load placed upon the node in terms of mainting close conectivity between the various communities of a network. I decided to use this measure as a starting point (fig.1) and everything else just started falling in place.

Figure 1 Prelims Graph in Black and White

The above figure is the Preliminaries graph rendered with the OpenOrd algorithm with node size adjusted for betweenness centrality and a size range from 5-40. Obviously, Lope de Vega dominates the graph, but other nodes also stand out: the Count of Lemos, the Duke of Lerma, Felipe Bernardo de Castillo, Gonzalo de Céspedes y Meneses, and Miguel de Cervantes. While each of these other important nodes will recieve further  investigation, today I will focus on the results of a line of inquiry that starts, but does not finish, with Lope de Vega.

First I decided to take a look at Lope’s Publication Network(fig.2). In the previous blog I introduced the concept of the publication network as an important way to understand the social circles associated with literary production. In the Prelims data set a publication network is an ego or neighbors network that extends to four degrees of seperation. This range includes the editors, censors, and other individuals important in the formal process of publication. The publication network also includes any other individuals that are more directly connected to the author: personal realtionships, patronage, dedications in the form of poetry or letters etc. Using the new find_neighbors function I wrote to identify neighbors networks at n degrees of seperation, I determined that Lope’s publication network consists of 1083 nodes or 67% of the nodes in the graph. Ok, he was well connected, that is something we can assume based on the extremely prolific nature of his literary production…so what can the Prelims graph tell us? Well for starters, it can tell who wasn’t in his publication network.

Fig. 2 Lope de Vega’s publication network

Lope de Vega.color = red

In order to do this, I removed the subset of nodes representing Lope’s publication network from the other nodes that make up the graph, then using another new function filter_by_type combined with return_label I was able to return a list of the names of all of the people who are not in Lope’s publication network. A quick review of this list produced some interesting results: I found several authors based in Spain including Gonzalo de Céspedes y Meneses and El Inca Garcilaso, and two authors active in Peru, Diego Dávalos y Figueroa and Pedro de Oña. While a quick look at both Céspedes and El Inca produced interesting results, the two Peruvian authors–actually neither was really Peruvian, Oña was born in Chile and Dávalos in Spain, but they both published books in Lima–attracted my attention. In this period social circles were indeed highly influenced by geography, and it is logical that these authors find themselves at the periphary of a network centered geographically in Madrid; however, I was curious about how far removed they actually are from Lope’s network and furthermore,  how their situation compares to authors working in New Spain (modern Mexico). Using set_intersect to indentify intersecting portions of the various neighbors networks, I found that both Pedro de Oña and Diego Dávalos y Figueroa are connected to Lope’s network at 3 degrees of seperation through their dedications to the Viceroy of Peru Luis de Velasco y Castilla, and at four degrees through Juan de Zúniga, Diego de Ojeda, and the Order of Santiago (fig. 3 and fig. 4)

Fig. 3 Diego Dávalos y Figueroa and Lope de Vega

Lope de Vega.color = red

Diego Dávalos y Figueroa.color = yellow

set_intersect.color = blue

 Fig. 4 Pedro de Oña’s and Lope de Vega

Lope de Vega.color = red

Pedro de Oña.color = yellow

set_intersect.color = blue

As previously noted, the relative lack of connections with the Madrid circle could be inferred based on geographic concerns, so in order to put this in perspective, I decided to take a look at several writers active in New Spain, which although located in the Americas was considerably closer to the center of Spanish literary production. I chose Bernardo de Balbuena an author and politition, who published in Mexico City and Spain, and Juan de Torquemada, a Franciscan chronicler who published in Madrid, both of whom were quite well connected in Spain as shown by the following images(fig.5 and fig. 6).  It is important to note here that there is considerable variation between the networks of the two authors associated with Mexico, an avenue that I hope to explore in the future.

 Figure 5 Juan de Torquemada y Lope de Vega

Lope de Vega.color = red

Juan de Torquemada.color = yellow

set_intersect.color = blue

Figure 6 Bernardo de Balbuena and Lope de Vega

Lope de Vega.color = red

Juan de Torquemada.color = yellow

set_intersect.color = blue

In order to compare the “Mexican” authors with the “Peruvian” authors, I combined the four social networks based on geographic constraints: however, I found that at 4 degrees of seperation there was no direct overlap, so I upped the parameter in find_neighbors to 5 degrees of seperation and produced the following image in which the “Mexican” network is red, the “Peruvian” is yellow, and the overlap is blues(fig.7).

Figure 7 American Connections

“Mexican” Authors.color = red

“Peruvian” Authors.color = yellow

set_intersect.color = blue

As we can see here, even at five degrees of seperation there are few overlaps between the networks. In a way this is a good thing, because it allows us to focus in easily on important points. Coming from the Peruvian side, we can see that the Order of Santiango generates an important connection originating from the Viceroy of Peru, Luis de Velasco and connecting with several individuals who are shared by both social circles(fig. 8).

Figure 8 The Order of Santiago

Also, we can see a connection through Juan de Zúñiga that originates with the Erratas of Miscelánea Austral by Dávalos y Figueroa(fig.9).

Figure 9 Erratas of Miscelánea Austral

Following this connection a bit further, we arrive at what I consider to be very interesting territory: the importance of the House of Zúñiga in the Americas(fig.10).

Figure 10 The House of Zúñiga:Peru

It is well known that the house of Zúniga was powerful in both Spain and the Americas, and also that certain members of this house were important patrons of the arts and literature; however, I don’t know if their role in transatlantic literary production has been adequetly explored–my next mission will be to search for studies that focus on this noble family. Let’s follow this path through the graph a bit more and see where it goes (fig.11).

Figure 11 House of Zúñiga:Mexico

Here we can see the obvious importance of the Zúñiga’s in Mexico during this period (an Archbishop and a Viceroy). This illustrates not only the political importance of the house of Zúñiga in New Spain, but also the importance of  political figures/nobility in publication and how the members of one house can spread their cultural influence throughout geographic space. To take this concept one step further, let’s follow the Zúñigas back the Spain(fig. 12).

Figure 12 House of Zúñiga: Spain

Here we find Alonso López de Zúñiga y Pérez de Guzmán, the Duke of Béjar, and–not surpisingly for all the quixotephiles out there–the first part of Don Quixote. It turns out that American authors were not the only ones soliciting support from the House of Zúñiga: Miguel de Cervantes dedicated part 1 of Don Quixote to the famous Duque de Béjar…hmmm. Interesting? Could be, at the very least worthy of further inquiry.

Here I have conducted a very brief first look at the Prelims graph. Although all of the above connections deserve more attention and some might even merit some real-deal academic concern, it is too early to say with any certainty. For the time being I will continue trying to crack the code of the Prelims graph and faithfully provide any results I can find. Two steps: 1- Hit the books, and then 2- do some old-fashioned number crunching using the Python software package NetworkX and produces some custom .gexf files to use with Gephi…wait did I say old fashioned? HeHe…

Until next time: Saludos

@dbrownbeta

P.S. Thanks versae for all the constructive criticism!

4 Comments

Filed under Uncategorized

Preliminary Analysis of the Preliminaries Project

Hello! Welcome back to the Preliminaries Project blog!

This week, as promised, I would like to give you all a bit more information about the project including the current status of the Preliminaries database and the methodology used in contructing the database. However, the primary focus of this entry will be on various techniques used to analyze the Preliminaries graph, due to the fact that I have spent the last few days trying to figure out how to do this. But first let me give you a bit of background about me.

My educational background is primarily literature and linguistics. I did my undergrad work at the University of Oregon, where I studied Spanish with a fair bit of Linguistics as a secondary focus. Last year, I started my graduate work as masters student here at Western studying Hispanic Literature. My first contact with using technological means to study literary topics came last spring in in Professor Suárez’s class about the Hispanic Baroque. As a class project we started building an early version of the Preliminaries database in Sylva. I ended up doing my final project on the social networks involved in the production of early editions of Don Quixote, and I haven’t looked back. Last summer, I began officially working here at the Culturplex Lab on the Preliminaries Project. So to make a long story short I am a rookie when it comes to digital humanities, computer modeling, and programming. This fall I have been taking a class that focuses on Python, a high level programming language that is popular amongst scientists of all types, and also a Coursera course about social network analysis. I am just learning how to use this technology, but I hope I can share some of this learning process with you and in the end maybe everyone will benefit. Okay enough about me…let’s get back to the project.

As I mentioned before the Prelims Project is ongoing, and although it isn’t 100% complete, the database is sufficiently devoloped to begin doing a bit of analysis. Currently the first editions list (Duque de Lerma, 1598-1618) constists of 330 editions, out of which I have been able to obtain 228 scanned copies of preliminary sections, approximately %70, which isn’t bad considering that these texts were published 400 years ago. Of these scans, around 120 have been entered into the database, producing a graph with 1612 nodes and 3472 relationships. Rendered in Gephi using the built in YifanHu’s Multlevel algorithm, colored for modularity, and sized for betweenness centrality, the graph looks like this:

This visualization is nice because you can see the general structure of the graph and the coloring gives you a good idea of the communities within the the network as a whole. However, the amount of information presented here is overwhelming, so I have been looking for some ways to control the visualization and the information on which it is based to allow for some detailed comparative analysis.

One of the nice features of Gephi is that it has a variety of built in filters to allow the user to limit the information that appears in the graph. Something that we are interested in regarding the Prelims Project is the community structures within the graph. Let’s use a filter to see the modules of various famous writers of the period:

First Miguel de Cervantes, author of Don Quixote

Then Lope de Vega, author of the Comedias

As you may imagine, this type of filtration is crucial for analysis. It allows us pick apart the graph and study the elements in a controlled and manageable fashion.

There is another type of subset within a graph called the Ego Network. These are based on direct conections between a node and its neighbors. Although Gephi also has an filter for Ego Networks, I encountered a small problem here: Gephi only allows filtering for up to three degrees of seperation. This presents a challenge with the Preliminaries graph due to the schema design for the database.

In order to establish a connection between the author and an edition there are two steps: Author->Obra, Obra->Edition. This is due to organizational/editorial concerns that I hope to address in the next blog. Furthermore, for the author to be related to the people involved in the approval, licensing, and publication of an edition, two more steps are required e.g Edition->Approval, Approval->Censor. Therefore to establish what I call a Publication Network, somewhat equivalent to an Ego Network, I need to be able to find neighbors for up to four degrees of seperation. Thankfully, Gephi includes a scripting console based on the Python programming language. Using functions based on the following patterns I am able to mimic the filtering abilities of Gephi and create a way to isolate and compare subsets of the graph in order to generate these Publication Networks:

 

 

 

 

 

 

 

 

 

It is also important to note that it is necessary to combine the subsets generated by these functions, which I have done using the following function “completelist”, and then to make sure there are no stray ‘NoneType’s or duplicates, which I have done with  “masterlist”:

Then, using the subsets generated here I can color and size the Publication Networks using the following functions:

I can also find the intersections of various Publication Networks using the following function:

Thus, using a very basic knowledge of Python I am able to manipulate the graph and compare any subsets of nodes that I would like.

An applied example of these functions would be the following:

Publication network of Bernardo de Balbuena, author of Grandeza mexicana: Red

Publication network of Juan de Torquemada, author of Monarquía Indian: Blue

Their intersecting Publication Networks: Yellow

That’s it for today folks. Over the next week and a half I hope to generate some definite results to talk about and some more refined visualizations using my newfound techie skills.

Hope to see you next time around. For more information you can always email me at: dbrow52@uwo.ca or follow me on twitter @dbrownbeta

 

 

 

8 Comments

Filed under Uncategorized