Generating plots and correlation coefficients with PostgreSQL and R

The major task of today was to generate some correlation coefficients showing that our approach to inferring data was consistent with established results. One of my colleagues generated a plot a few months (years?) ago that showed a respectable correlation of 0.86. Unfortunately, there are only 11 points on the plot, where there should be closer to 500 000. In the many presentations I’ve seen on this topic, that correlation slide is always questioned.

Fortunately, all of this data is available in our PostgreSQL database. Unfortunately, it was an adventure in several languages and programs that I tend to avoid: Perl, vi, and especially R.

It starts with the easy part. Databases are comfortable and familiar. I pulled out the values I thought I needed from the database by writing my command and piping the results to a file. Some of these grew quite large for text files: around 25MB, in some cases. Of course, this file size tends to make most text editors explode, so I had to work with them using Perl and vi. The file begins like this:

pair_id | comp_relation_id | weight | weight
---------+------------------+-----------------+----------
78720 |               11 |   1.26140273357 |  3.11573
78729 |               11 |   1.26140273357 |  4.37713
78739 |               11 |   1.15960269017 |  9.11115
...

Why they don’t just tab separate it is anyone’s guess. It would be as readable and more parse-able. Who really spends a lot of time looking at the raw PSQL output anyway?
Fun fact: for a 25MB text file, a find and replace in gedit takes about 10 minutes. Yes, I tried. Using regular expressions in Perl, it takes about 10 milliseconds, after you commit the first hour to wade through regular expression documentation.

perl -p -e 's/[ \t]+//g' myquery.results > myquery_nows.results

Once the whitespace was gone, a few commands in vi rid me of the header divider and gave my columns better names.

pair_id|comp_relation_id|inferred_weight|established_weight
78720|11|1.26140273357|3.11573
78729|11|1.26140273357|4.37713
78739|11|1.15960269017|9.11115
...

Finally I can stick it into R. I used Darren Wilkinson’s R tutorial and found this function corr to calculate my correlation. I think. Also, I found out that R² is actually the coefficient of determination, and I need to square whatever comes out of the corr function.

#First, install the necessary package
install.packages("boot")
library(boot)

#load the data and generate the plot
network = read.csv("myquery_nows.results", sep="|")
plot(network$established_weight, network$inferred_weight)

#calculate the correlation
xAndY = network[, c(3, 4)]
corr(xAndY)

Tah Daaah!

Well. That is pretty shitty. I’m really hoping that it’s just a case of not using the correct values from the database. Off to email my colleague to find out how he generated the original plot. But I’m pleased to announce that I am somewhat competent at the above tasks and will be able to perform them again once I’ve refined the input data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s