In order to earn the respect that comes from being a full-blown Circos wizard, there is a significant learning curve. The Circos website offers many tutorials, which we will not try to replicate here. Instead, we started out own journey up the Circos mountain, and documented our thoughts, feelings, and work. Sometimes it's helpful to have things said in a different way.
The Goal
First thing, let's take a look at what we're aiming for:Some things to notice:
- You can add as many layers as you like! There is very little stopping you from piling in every last scrap of data you have in increasingly complex concentric circles.
- There are many different tools for displaying data including scatter plots, histograms and line graphs, links, labels, and more. These can be arranged in concentric circles or in overlapping layers.
- All of this flexibility and power comes with some overhead cost. It's kind of complicated to make one of these. This is a tool for the serious scientific graphic designer.
Formatting the Data
All the data and code to accomplish this project can be found in our bitbucket repository.The text for each sonnet was obtained freely from Project Gutenburg. In order to plot these sonnets, the first file you'll need is a "karyotype" file. The karyotype file defines the structure of the circumference of your plot. In this case, each sonnet is assigned to be a unique "chromosome". The structure of this file is pretty straightforward: chr - ID Label Start Length Color:
Next, the data files generally take an also straightforward format. For example, to indicate each occurrence of the word "love", a file has on each line: the chromosome ID (e.g. "son4"), the position of the word (start and end, which here we set both to be the start), and the value of the marker at that position (which we set all to zero, since the position matters here, not a measurement as in the next example).
The same thing is done for the frequency of the character "e" in each 100 character window. The chromosome ID is indicated with the beginning and end of the window, then the frequency (which corresponds to the histograms in the plot).
The links are defined using a format that includes a set of paired lines in the file. Each row in the file indicates an end of the link, provides information about which chromosome the link ends at, and what position within that chromosome.
All of these files are created using the cleverly-named Python script "parse_sonnets.py".
Plotting
Now we get to Circos. First, you'll need to download Circos and install any dependencies. Circos is based on Perl, so you'll need to obtain a Perl distribution if you don't already have one. For Windows users, Strawberry Perl works fine. Unix users probably already have Perl installed.The actual Circos executable is found in the bin directory and is aptly named "circos". Attempting to run this the first time may unearth the Perl modules you're missing. CPAN has all of the needed dependencies for free, and they are easy to install.
The heart of the Circos plot is the configuration file, which is formatted as a markup language. Our file that generated the figure above is named "sonnets.conf". There are many intricacies and loads of flexibility built into this plotting resource, so we will just highlight the essentials:
Include the correct karyotype file to build the backbone of your plot.
Create the colored blocks and corresponding labels. In this case, these indicate the relative length of each sonnet (they're all about the same length, but they're different, we promise!). These are defined between the "ideogram" tags.
Add links between regions of identical text using the "links" tags. Notice that the "sonnet_overlaps.txt" file is included as the source of the data here.
Finally, the histogram (between the colored blocks and the labels) and the scatter plot (the markers indicating the position of the words "love" and "hate") are defined within the "plots" tags. Notice that the files "sonnet_love_occurrences.txt", "sonnet_hate_occurrences.txt", and "sonnet_E_frequency.txt" are included as the data source for each individual plot.
One everything is in place including your configuration file, karyotype file, and data files, all you need to do is run circos indicating the configuration file. On a Windows machine you would run a command like this:
perl circosPath\circos -conf sonnets.conf
No comments:
Post a Comment