One of the most interesting talks I heard at the “Tools for Astronomical Big Data” Meeting I attended in March was by Carlos Scheidegger (University of Arizona) on “How do you look at a billion data points? Exploratory Visualization for Big Data.” It was my first detailed introduction to creating and querying nano cubes, a data structure that enables interactive visualizations of data sources with billions of elements. The idea behind them is that they are constructed to fit in a modern laptop’s main memory, even for hundreds of millions of data entries. In this way large and complex data sets can be visualized interactively, without the latency involved in reading from disk and without the need for high-performance computers for performing any processing.
Scheidegger and his team of Lauro Lins, Horace Ip and James Kloswoski have aweb page, nanocubes.net, which explains their project. Their Open Source code is available on GitHub at https://github.com/laurolins/nanocube. If you want a detailed technical description, read their paper Lauro Lins, James T. Klosowski, and Carlos Scheidegger. “Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. Visualization and Computer Graphics, IEEE Transactions on 19, no. 12 (2013): 2456-2465″ (free PDF). I found that their article “Into Main Memory: Nanocubes for Interactively Visualizing Billion-Point Data Sets” provides an excellent introduction to the topic.
There is also a youTube video explaining the principles and showing some fascinating visualizations:
So how does it all work? Compared with a data cube that stores the original data records, a nanocube has a reduced memory footprint because it stores aggregate counts rather than the original records (e.g. total number of tweets on Chicago per hour and so on), and uses a tree structure made up of cells for only those subsets that exist in the data set. But there is more to it than that. Scheidegger takes up the story. ” … we have merged a dwarf cube’s reduced memory footprint and fast querying with the ability to handle spatial and temporal dimensions. A nanocube retains the hierarchical tree structure of a dwarf cube but introduces a layering concept that includes three types of dimensions: spatial, categorical, and temporal, always traversed in that order. Each dimension has a start node and is subdivided into levels, the number of which depends on the data set. A query is a path through this tree.
The first dimension to be traversed is the spatial one, which in a nanocube can have up to 25 levels. The start point is the entire-world level followed by successively more granular levels all the way down to city blocks. (This is similar to Google Maps 17 spatial layers, with the highest resolution level translating, depending on the imagery, to about 10-30 meters.)
The nanocube uses the quadtree algorithm, a well-known spatial indexing technique in visualization to enable users to smoothly zoom in or out from one location to another. A quadtree divides the world into four quadrants, and each quadrant has four children, each of them likewise having four children also. In this way, the world is successively divided up into finer tiles, allowing users to visualize spatial data at arbitrary zoom levels.”
Here are some examples; see the web page for some interactive demos (from: Nanocubes for Real-Time Exploration of Spatiotemporal Datasets, Lauro Lins, James T. Klosowski, and Carlos Scheidegger;IEEE Transactions on 19, no. 12 (2013): 2456-2465; http://nanocubes.net/assets/pdf/nanocubes_paper.pdf)