Next Previous Up Contents
Next: Examples
Up: Invoking TOPCAT
Previous: JDBC Configuration

9.4 Tips for Large Tables

Considerable effort has gone into making TOPCAT capable of dealing with large datasets. In particular it does not in general have to read entire files into memory in order to do its work, so it's not restricted to using files which fit into the java virtual machine's 'heap memory' or into the physical memory of the machine. As a rule of thumb, the program will work with tables up to about a million rows at a reasonable speed; the number of columns is less of an issue (though see below concerning performance). However, the way you invoke the program affects how well it can cope with large tables; you may in some circumstances get a message that TOPCAT has run out of memory (either a popup or a terse "OutOfMemoryError" report on the console), and there are some things you can do about this:

Use the -disk flag
By default, TOPCAT reads the data from tables in most file formats (everything except FITS binary tables) into memory when it loads them. By specifying the -disk flag on the command line, you can tell it to use temporary files for storing this data instead. This makes it much less likely to run out of memory. You can achieve the same effect by adding the line startable.storage=disk in the .starjava.properties in your home directory. This is the best thing to do if you're seeing an out of memory error when loading tables. See Section 9.1, Section 9.2.3.
Increase Java's heap memory
When a Java program runs, it has a fixed maximum amount of memory that it will use; Java does not really make use of virtual memory. The default maximum is typically 64Mb. You can increase this by using the -Xmx flag, followed by the maximum heap memory, for instance "topcat -Xmx256M" or "java -Xmx256M -jar topcat-full.jar". Don't forget the "M" to indicate megabytes. It's generally reasonable to increase this value up to nearly the amount of free physical memory in your machine if you need to (taking account of the needs of other processes running at the same time) but attempting any more will usually result in abysmal performance. See Section 9.2.2.
Use FITS files
Because of the way that FITS files are organised, TOPCAT is able to load tables which are stored as uncompressed FITS binary tables on disk almost instantly regardless of their size (in this case loading just reads the metadata, and any data elements are only read if and when they are required). So if you have a large file stored in VOTable or ASCII-based form which you use often and takes a long time to load, you may wish to convert it to FITS format once for subsequent use. You can do this either by loading it into TOPCAT once and saving it as FITS, or using the separate command-line package STILTS. Note that the "fits-plus" variant which TOPCAT uses by default retains all the metadata that would be stored in a corresponding VOTable, so you won't normally lose information by doing this (see Section 4.1.2.1). As well as speeding things up, using FITS files will also reduce the need to use -disk or -Xmx flags as above.
Use column-oriented storage
For really large tables storing them in the colfits output format can significantly improve performance. This stores all the elements of a single column contiguously on disk, which means that scanning through all the values in one or a few columns can proceed with much less unnecessary I/O than in normal (row-oriented) FITS format. It will make most difference when the table is larger than the amount of physical memory available, and the table has many columns. Be aware however that operations which require all the cells in all the rows (for instance, calculating row statistics) may be somewhat slower using this format.

It is also possible to use column-oriented storage for non-FITS files by specifying the flag -Dstartable.storage=sideways. This is like using the -disk flag but uses column-oriented rather than row-oriented temporary files. However, using it for such large files means that the conversion is likely to be rather slow, so you may be better off converting the original file to colfits format in a separate step and using that.

As far as performance goes, the memory size of the machine you're using does make a difference. If the size of the dataset you're dealing with (this is the size of the FITS HDU if it's in FITS format but may be greater or less than the file size for other formats) will fit into unused physical memory then general everything will run very quickly because the operating system can cache the data in memory; if it's larger than physical memory then the data has to keep being re-read from disk and most operations will be much slower, though use of column-oriented storage can help a lot in that case.


Next Previous Up Contents
Next: Examples
Up: Invoking TOPCAT
Previous: JDBC Configuration

TOPCAT - Tool for OPerations on Catalogues And Tables
Starlink User Note253
TOPCAT web page: http://www.starlink.ac.uk/topcat/
Author email: m.b.taylor@bristol.ac.uk