The major task of today was to generate some correlation coefficients showing that our approach to inferring data was consistent with established results. One of my colleagues generated a plot a few months (years?) ago that showed a respectable correlation of 0.86. Unfortunately, there are only 11 points on the plot, where there should be closer to 500 000. In the many presentations I’ve seen on this topic, that correlation slide is always questioned.
Fortunately, all of this data is available in our PostgreSQL database. Unfortunately, it was an adventure in several languages and programs that I tend to avoid: Perl, vi, and especially R.
It starts with the easy part. Databases are comfortable and familiar. I pulled out the values I thought I needed from the database by writing my command and piping the results to a file. Some of these grew quite large for text files: around 25MB, in some cases. Of course, this file size tends to make most text editors explode, so I had to work with them using Perl and vi. The file begins like this:
pair_id | comp_relation_id | weight | weight
78720 | 11 | 1.26140273357 | 3.11573
78729 | 11 | 1.26140273357 | 4.37713
78739 | 11 | 1.15960269017 | 9.11115
Why they don’t just tab separate it is anyone’s guess. It would be as readable and more parse-able. Who really spends a lot of time looking at the raw PSQL output anyway?
Fun fact: for a 25MB text file, a find and replace in gedit takes about 10 minutes. Yes, I tried. Using regular expressions in Perl, it takes about 10 milliseconds, after you commit the first hour to wade through regular expression documentation.
perl -p -e 's/[ \t]+//g' myquery.results > myquery_nows.results
Once the whitespace was gone, a few commands in vi rid me of the header divider and gave my columns better names.
Finally I can stick it into R. I used Darren Wilkinson’s R tutorial and found this function corr to calculate my correlation. I think. Also, I found out that R² is actually the coefficient of determination, and I need to square whatever comes out of the corr function.
#First, install the necessary package
#load the data and generate the plot
network = read.csv("myquery_nows.results", sep="|")
#calculate the correlation
xAndY = network[, c(3, 4)]
Well. That is pretty shitty. I’m really hoping that it’s just a case of not using the correct values from the database. Off to email my colleague to find out how he generated the original plot. But I’m pleased to announce that I am somewhat competent at the above tasks and will be able to perform them again once I’ve refined the input data.