Parquet in the VO

Version 1.0

IVOA Note 2025-01-14

Working Group

Applications

This Version

https://www.ivoa.net/documents/Notes/voparquet/20250114

Latest Version

https://www.ivoa.net/documents/Notes/voparquet

Previous Versions

This is the first public release

Author(s)

Editor(s)

Mark Taylor

Version Control

Revision 7916569, last change 2025-01-14 14:44:31 +0000

Abstract

Parquet is a file format for record-based data, with widespread industry tool support. It is being adopted by several astronomy projects for bulk storage and distribution of large tabular data products. This Note discusses best practice for use of parquet within the VO, and in particular defines the VOParquet convention which uses VOTable to attach rich astronomical metadata to otherwise metadata-poor parquet files.

Status of this Document

This is an IVOA Note expressing suggestions from and opinions of the authors. It is intended to share best practices, possible approaches, or other perspectives on interoperability with the Virtual Observatory. It should not be referenced or otherwise interpreted as a standard specification.

A list of current IVOA Recommendations and other technical documents can be found in the IVOA document repository.

1 Introduction
    1.1 Scope
2 VOParquet Convention
    2.1 Serialization Approach
    2.2 Serialization Format
    2.3 Data/Metadata Mismatches
3 Alternative Approaches
A Changes from Previous Versions

Conformance-related definitions

The words "MUST", "SHALL", "SHOULD", "MAY", "RECOMMENDED", and "OPTIONAL" (in upper or lower case) used in this document are to be interpreted as described in IETF standard RFC2119 (Bradner, 1997).

The Virtual Observatory (VO) is a general term for a collection of federated resources that can be used to conduct astronomical research, education, and outreach. The International Virtual Observatory Alliance (IVOA) is a global collaboration of separately funded projects to develop standards and infrastructure that enable VO applications.

1 Introduction

The Apache Parquet file format¹ is a column-oriented storage format for record-based data first developed in 2013. It offers per-column compression, dictionary encoding, and a kind of column value indexing. A number of data processing environments optimised for parallel computing make use of these features to enable fast processing of large or very large tables. At the time of writing (early 2025), several astronomy projects including Rubin, Gaia and SPHEREx are using or planning to use parquet for storage, processing and distribution of large-scale astronomical data products. I/O libraries are available in various languages including Python, Java, C++ and Rust, and these have been leveraged by astronomer-facing software such as Astropy, CDS services and TOPCAT to facilitate use of parquet data in astronomy.

While offering efficient data storage however, the standard semantic metadata provided by Parquet files is quite rudimentary. Apart from a name and datatype for each column, there is only a list of untyped key-value pairs per table and per column-chunk, with no standard semantics for the keys. For scientific usability, better semantic metadata is desirable and even necessary, especially in view of the complexity of the data represented; astronomy tables can easily contain hundreds of columns. A minimum requirement is column attributes such as units, descriptions, UCDs and precisions; in many cases additional information relating to coordinate systems, service descriptors or processing flags may also be required.

The VOTable format (Ochsenbein and Taylor et al., 2019) has been developed within the VO since its inception to hold exactly the kind of metadata required here. Combining the virtues of VOTable and Parquet therefore can supply a format which delivers storage efficiency alongside rich astronomical metadata.

This Note addresses the question of how to effect that combination. In particular it defines a convention named VOParquet that stores VOTable metadata in parquet files in a way which will be interoperable between data producers and consumers from different projects. Although usage may be refined in future as the result of developing requirements and implementation experience, the intention (at least, the hope) is that the prescriptions here will remain valid as a backwardly compatible baseline for some while, so that future iterations of parquet I/O software in the VO will remain compatible with files written according to this Note. For this reason, we restrict ourselves here to the minimum rules that will enable interoperability, and avoid imposing requirements on points of detail for which the best way forward is not obvious.

The normative part of this VOParquet convention is quite short and can be found in Section 2 and especially Subsection 2.2. The other sections provide context and discussion.

1.1 Scope

The topic of this Note suggests other discussions, including best practice for sharding large datasets among multiple parquet files, policy for choice of compression algorithms within parquet, suitability of parquet for archival storage, and the application of similar ideas to enhance other metadata-poor file formats using VOTable.

This Note avoids those questions in the interest of achieving rapid consensus on the question of combining VOTable and parquet. Several projects will be generating large parquet collections in the near future, so that early agreement on the basics of the format is required to achieve interoperability between a number of datasets too large to be rewritten at a later date.

Future work may build on the current document and on implementation experience to produce a revised Note or a Recommendation-track document that extends the current proposal or addresses some of these wider questions.

2 VOParquet Convention

2.1 Serialization Approach

The parquet and VOTable file formats both provide serialization of tabular data, along with some degree of file- and column-level metadata. Given an abstract input table with rich metadata, the basic prescription for writing a VOParquet file is:

serialize the input table to VOTable but without including the data part, thus producing an XML document containing table metadata only
serialize the input table to parquet in the usual way, but
include the data-less VOTable document in the file-level metadata of the parquet file

The parquet file generated in the final step is the VOParquet output. When reading such a file:

read the parquet data in the usual way
search for a VOTable in the file-level metadata
if one is present, parse it and use the table- and column-level metadata it contains to decorate the data read from parquet

The serialised table is therefore a perfectly legal parquet file, which can be read by any parquet I/O software. But VOParquet-aware software can use the attached dataless VOTable to recover the rich metadata associated with the original table.

2.2 Serialization Format

The VOTable metadata document stored in the parquet metadata must contain a TABLE element describing the parquet data table. This looks exactly like a normal TABLE element except that it has no DATA child; such dataless tables are permitted by the VOTable schema. In particular it must contain FIELD elements describing the columns of the parquet data, and it may contain other elements such as PARAM, COOSYS etc providing additional table-level metadata.

This DATA-less TABLE must be the first TABLE element in the VOTable document. Other TABLEs, for instance providing auxiliary data or metadata, may appear in the VOTable document, but are not used to describe the parquet data directly. The VOTable document must be a schema-valid and legal VOTable instance. No particular VOTable version is mandated by this convention.

An example VOTable metadata document describing a 3-column table might look like this²:

<VOTABLE version="1.4" xmlns="http://www.ivoa.net/xml/VOTable/v1.3">
  <RESOURCE>
    <TABLE name="MessierObjects">
      <DESCRIPTION>Nebulae and clusters</DESCRIPTION>
      <PARAM name="author" datatype="char" arraysize="*"
             value="Charles Messier"/>
      <FIELD datatype="long" name="ID">
        <DESCRIPTION>Source identifier</DESCRIPTION>
      </FIELD>
      <FIELD datatype="double" name="RA" ucd="pos.eq.ra" unit="deg">
        <DESCRIPTION>ICRS Right Ascension</DESCRIPTION>
      </FIELD>
      <FIELD datatype="double" name="DEC" ucd="pos.eq.dec" unit="deg">
        <DESCRIPTION>ICRS Declination</DESCRIPTION>
      </FIELD>
      <!-- Metadata-only TABLE - no DATA element -->
    </TABLE>
  </RESOURCE>
</VOTABLE>

This dataless VOTable document is stored in the key_value_metadata list of the FileMetaData structure in the parquet footer. That list is defined by the parquet file format³ to contain an unstructured collection of string-string key-value pairs, and is available for applications to populate with arbitrary metadata. The VOParquet convention requires the following key-value pairs to be present:

IVOA.VOTable-Parquet.version: The version of this convention. Must be "1.0" at this version.
IVOA.VOTable-Parquet.content: The content of the data-less VOTable document described above, encoded using UTF-8. An XML declaration ("<?xml ... ?>") may optionally precede the content, but if present it must not declare a non-UTF-8 encoding.