Wednesday, July 1, 2015

Circos and Shakespeare

You know those cool circular plots that manage to display every conceivable piece of data and still look classy? They've been displayed on the cover of more high profile journals than you can shake a stick at ... you know the ones I mean? Circos plots. If there is a social hierarchy among data visualization tools, Circos is arguably in the aristocracy. If you can produce a Circos plot from scratch with your own data, you are bound to get compliments (and acknowledgements from your audience such as "That's a Circos plot, isn't it? Nice.").

In order to earn the respect that comes from being a full-blown Circos wizard, there is a significant learning curve. The Circos website offers many tutorials, which we will not try to replicate here. Instead, we started out own journey up the Circos mountain, and documented our thoughts, feelings, and work. Sometimes it's helpful to have things said in a different way.

The Goal

First thing, let's take a look at what we're aiming for:

Around the circumference of the plot are 154 of William Shakespeare's sonnets. The block colors are assigned randomly. Red markers indicate where the word "love" is used, while blue markers indicate where the word "hate" is found. Black lines connect identical fragments of text of 12 characters or longer (excluding white space). Between the colored blocks and the labels are small histograms indicating the frequency of the letter "e" in 100 character windows of text.

Some things to notice:
  1. You can add as many layers as you like! There is very little stopping you from piling in every last scrap of data you have in increasingly complex concentric circles.
  2. There are many different tools for displaying data including scatter plots, histograms and line graphs, links, labels, and more. These can be arranged in concentric circles or in overlapping layers.
  3. All of this flexibility and power comes with some overhead cost. It's kind of complicated to make one of these. This is a tool for the serious scientific graphic designer.

Formatting the Data

All the data and code to accomplish this project can be found in our bitbucket repository.

The text for each sonnet was obtained freely from Project Gutenburg. In order to plot these sonnets, the first file you'll need is a "karyotype" file. The karyotype file defines the structure of the circumference of your plot. In this case, each sonnet is assigned to be a unique "chromosome". The structure of this file is pretty straightforward: chr - ID Label Start Length Color:
Next, the data files generally take an also straightforward format. For example, to indicate each occurrence of the word "love", a file has on each line: the chromosome ID (e.g. "son4"), the position of the word (start and end, which here we set both to be the start), and the value of the marker at that position (which we set all to zero, since the position matters here, not a measurement as in the next example).

The same thing is done for the frequency of the character "e" in each 100 character window. The chromosome ID is indicated with the beginning and end of the window, then the frequency (which corresponds to the histograms in the plot).

The links are defined using a format that includes a set of paired lines in the file. Each row in the file indicates an end of the link, provides information about which chromosome the link ends at, and what position within that chromosome.


All of these files are created using the cleverly-named Python script "parse_sonnets.py".

Plotting

Now we get to Circos. First, you'll need to download Circos and install any dependencies. Circos is based on Perl, so you'll need to obtain a Perl distribution if you don't already have one. For Windows users, Strawberry Perl works fine. Unix users probably already have Perl installed.

The actual Circos executable is found in the bin directory and is aptly named "circos". Attempting to run this the first time may unearth the Perl modules you're missing. CPAN has all of the needed dependencies for free, and they are easy to install.

The heart of the Circos plot is the configuration file, which is formatted as a markup language. Our file that generated the figure above is named "sonnets.conf". There are many intricacies and loads of flexibility built into this plotting resource, so we will just highlight the essentials:

Include the correct karyotype file to build the backbone of your plot.

Create the colored blocks and corresponding labels. In this case, these indicate the relative length of each sonnet (they're all about the same length, but they're different, we promise!). These are defined between the "ideogram" tags.

Add links between regions of identical text using the "links" tags. Notice that the "sonnet_overlaps.txt" file is included as the source of the data here.
Finally, the histogram (between the colored blocks and the labels) and the scatter plot (the markers indicating the position of the words "love" and "hate") are defined within the "plots" tags. Notice that the files "sonnet_love_occurrences.txt", "sonnet_hate_occurrences.txt", and "sonnet_E_frequency.txt" are included as the data source for each individual plot.
One everything is in place including your configuration file, karyotype file, and data files, all you need to do is run circos indicating the configuration file. On a Windows machine you would run a command like this:

perl circosPath\circos -conf sonnets.conf

Final Notes

As we mentioned, there is a whole universe of possibilities and custom solutions. Go play around with this code, follow the Circos tutorials, and find examples of previous work to inspire you. As an ending note, the Circos software outputs the plot as both a PNG file, and an SVG. This means that you can further customize these plots using GIMP and Inkscape. You'll be garnering glory and honor at the next poster session in no time!


No comments:

Post a Comment