OpenMS
Loading...
Searching...
No Matches
MSExperimentArrowExport Class Reference

Export MSExperiment data to Apache Arrow format. More...

#include <OpenMS/FORMAT/MSExperimentArrowExport.h>

Static Public Member Functions

static std::vector< std::string > getSpectraArrowColumnNames (const MSExperiment &exp, const ArrowSpectraExportConfig &config=ArrowSpectraExportConfig{})
 Get available column names for spectra Arrow export.
 
static std::vector< std::string > getChromatogramArrowColumnNames (const MSExperiment &exp, const ArrowChromatogramExportConfig &config=ArrowChromatogramExportConfig{})
 Get available column names for chromatogram Arrow export.
 
static bool exportSpectraToArrowCDataInterface (const MSExperiment &exp, const ArrowSpectraExportConfig &config, ::ArrowSchema *out_schema, ::ArrowArray *out_array)
 Export spectra to Arrow via C Data Interface (zero-copy to Python)
 
static bool exportChromatogramsToArrowCDataInterface (const MSExperiment &exp, const ArrowChromatogramExportConfig &config, ::ArrowSchema *out_schema, ::ArrowArray *out_array)
 Export chromatograms to Arrow via C Data Interface (zero-copy to Python)
 
static bool exportSpectraToParquet (const MSExperiment &exp, const String &filename, const ArrowSpectraExportConfig &config=ArrowSpectraExportConfig{}, const ParquetWriteConfig &parquet_config=ParquetWriteConfig{})
 Export MSExperiment spectra to Parquet file.
 
static bool exportChromatogramsToParquet (const MSExperiment &exp, const String &filename, const ArrowChromatogramExportConfig &config=ArrowChromatogramExportConfig{}, const ParquetWriteConfig &parquet_config=ParquetWriteConfig{})
 Export MSExperiment chromatograms to Parquet file.
 

Detailed Description

Export MSExperiment data to Apache Arrow format.

This class provides static methods to export MSExperiment spectra and chromatograms to Apache Arrow Tables and Parquet files.

Experimental classes:
This API is experimental and may change in future versions. The table schema, column names, and data types are subject to modification based on user feedback and evolving requirements.

Member Function Documentation

◆ exportChromatogramsToArrowCDataInterface()

static bool exportChromatogramsToArrowCDataInterface ( const MSExperiment exp,
const ArrowChromatogramExportConfig config,
::ArrowSchema *  out_schema,
::ArrowArray *  out_array 
)
static

Export chromatograms to Arrow via C Data Interface (zero-copy to Python)

Parameters
[in]expThe MSExperiment to export
[in]configExport configuration
[out]out_schemaPointer to ArrowSchema struct
[out]out_arrayPointer to ArrowArray struct
Returns
true on success, false on error

◆ exportChromatogramsToParquet()

static bool exportChromatogramsToParquet ( const MSExperiment exp,
const String filename,
const ArrowChromatogramExportConfig config = ArrowChromatogramExportConfig{},
const ParquetWriteConfig parquet_config = ParquetWriteConfig{} 
)
static

Export MSExperiment chromatograms to Parquet file.

Exports chromatogram data to Apache Parquet format. See exportSpectraToParquet() for details on Parquet benefits and options.

Parameters
[in]expThe MSExperiment to export
[in]filenameOutput file path
[in]configArrow export configuration
[in]parquet_configParquet writing options
Returns
true on success, false on error

◆ exportSpectraToArrowCDataInterface()

static bool exportSpectraToArrowCDataInterface ( const MSExperiment exp,
const ArrowSpectraExportConfig config,
::ArrowSchema *  out_schema,
::ArrowArray *  out_array 
)
static

Export spectra to Arrow via C Data Interface (zero-copy to Python)

Exports the Arrow schema and array to C Data Interface format, which allows zero-copy transfer to PyArrow via pyarrow.Table._import_from_c().

Parameters
[in]expThe MSExperiment to export
[in]configExport configuration
[out]out_schemaPointer to ArrowSchema struct (caller must allocate)
[out]out_arrayPointer to ArrowArray struct (caller must allocate)
Returns
true on success, false on error
Note
The caller is responsible for calling the release callbacks on the schema and array when done.
This is primarily intended for Python bindings for zero-copy export.

◆ exportSpectraToParquet()

static bool exportSpectraToParquet ( const MSExperiment exp,
const String filename,
const ArrowSpectraExportConfig config = ArrowSpectraExportConfig{},
const ParquetWriteConfig parquet_config = ParquetWriteConfig{} 
)
static

Export MSExperiment spectra to Parquet file.

Exports spectra data to Apache Parquet format, which provides:

  • Columnar storage optimized for analytical queries
  • Efficient compression (typically 3-5x for MS data with ZSTD)
  • Fast partial reads (only requested columns/row groups are loaded)
  • Wide ecosystem support (DuckDB, Polars, pandas, Spark, R arrow)

Long format schema (one row per peak):

  • mz (float64): Peak m/z value (64-bit for mass accuracy)
  • intensity (float32): Peak intensity (32-bit sufficient for dynamic range)
  • rt (float32): Retention time in seconds
  • ion_mobility (float32, nullable): Ion mobility value if present
  • spectrum_index (uint32): Index of spectrum in MSExperiment
  • ms_level (uint8): MS level (1, 2, ...) - small integer, no encoding benefit
  • native_id (utf8 string): Native spectrum identifier
  • precursor_mz (float64, nullable): Precursor m/z (null for MS1)
  • precursor_charge (int16, nullable): Precursor charge
  • precursor_intensity (float32, nullable): Precursor intensity
  • isolation_lower (float64, nullable): Isolation window lower offset
  • isolation_upper (float64, nullable): Isolation window upper offset

Semi-wide format schema (one row per spectrum):

  • spectrum_index (uint32): Index of spectrum in MSExperiment
  • rt (float32): Retention time in seconds
  • ms_level (uint8): MS level
  • native_id (utf8 string): Native spectrum identifier
  • mz (list<float64>): Array of m/z values
  • intensity (list<float32>): Array of intensity values
  • ion_mobility (list<float32>, nullable): Array of ion mobility values
  • precursor_mz (float64, nullable): Precursor m/z (null for MS1)
  • precursor_charge (int16, nullable): Precursor charge
  • precursor_intensity (float32, nullable): Precursor intensity
  • isolation_lower (float64, nullable): Isolation window lower offset
  • isolation_upper (float64, nullable): Isolation window upper offset

Performance notes:

  • ZSTD compression gives best size/speed tradeoff for MS data
  • Parquet automatically applies RLE for repetitive values on disk
  • Row group size of 128MB balances parallelism and compression
  • Statistics enable predicate pushdown for m/z and RT range queries
  • For very large files (>100M peaks), consider exporting MS levels separately
Parameters
[in]expThe MSExperiment to export
[in]filenameOutput file path (.parquet extension recommended)
[in]configArrow export configuration (filtering, format, columns)
[in]parquet_configParquet writing options (compression, row groups)
Returns
true on success, false on error
Note
Errors are logged via OPENMS_LOG_ERROR before returning false.

Example:

// ... load data ...
// Export with default settings (ZSTD compression, 128MB row groups)
ArrowExport::exportSpectraToParquet(exp, "spectra.parquet");
// Export MS2 only with maximum compression
config.ms_levels = {2};
pq_config.compression_level = 9;
ArrowExport::exportSpectraToParquet(exp, "ms2_spectra.parquet", config, pq_config);
In-Memory representation of a mass spectrometry run.
Definition MSExperiment.h:49
std::vector< UInt > ms_levels
MS levels to include (empty = all levels)
Definition MSExperimentArrowExport.h:66
Configuration for Arrow export of spectra data.
Definition MSExperimentArrowExport.h:61
Configuration for Parquet file writing.
Definition MSExperimentArrowExport.h:136
int compression_level
Definition MSExperimentArrowExport.h:154
Compression compression
Compression algorithm (default: ZSTD for best ratio/speed)
Definition MSExperimentArrowExport.h:148
@ ZSTD
Best ratio/speed tradeoff (recommended, default)

◆ getChromatogramArrowColumnNames()

static std::vector< std::string > getChromatogramArrowColumnNames ( const MSExperiment exp,
const ArrowChromatogramExportConfig config = ArrowChromatogramExportConfig{} 
)
static

Get available column names for chromatogram Arrow export.

Parameters
[in]expThe MSExperiment to analyze
[in]configExport configuration
Returns
Vector of available column names

◆ getSpectraArrowColumnNames()

static std::vector< std::string > getSpectraArrowColumnNames ( const MSExperiment exp,
const ArrowSpectraExportConfig config = ArrowSpectraExportConfig{} 
)
static

Get available column names for spectra Arrow export.

Returns the list of column names that would be included in the export based on the configuration and the actual data in the experiment.

Parameters
[in]expThe MSExperiment to analyze
[in]configExport configuration
Returns
Vector of available column names