Přejít na menu

Data storage formats: Avro, Protobuffers, Parquet, ORC

Správa článků

Vyhledávání Vyhledávání
4.3.2017 16:33
,
Počet přečtení: 1915
Obrázek ke článku Data storage formats: Avro, Protobuffers, Parquet, ORCComparsion of data formats for storaging and transmitting.

Usage domains:

  • APIs: Protobuffers, Thrift, Avro
  • storage for data analysis (Hive, Impala, ...): ORC, Parquet, Avro
  • data storage: Sequence files, compressed text (gzip, bzip2, lz4), Avro

Protobuf

  • = Google's protocol buffers
  • defines one record serialisation
  • suitable for data transportation
  • good for optional attributes - message contains just data for present attributes
  • attributes are identified by id
  • schema evolution
  • similar: capnproto

Thrift

  • developed in Facebook, newer than Protobuf
  • slightly slower and bigger than Protobuf
  • more complex data types  than Protobuf
  • RPC implementation
  • schema evolution

Avro

  • row-based
  • defines record and also container serialisation
  • schema evolution
  • IDL uses JSON
  • splittable in Hadoop
  • data corruption of container: sync markers between data blocks => after corruption, all records to the end of the particular block will be lost
  • good for complex tables with strings
  • schema in the header => no need of external schema
  • rows can be appended

Parquet + ORC

  • column-based
  • great when reading subset of attributes
  • schema in the footer
  • splittable in Hadoop
  • stores statistics of columns (min, max, count); ORC has indexes
  • hierarchical data structures
  • write-once formats

Protobuf, Thrift and Avro comparsion

Vytvořil 4. března 2017 v 17:14:45 mira. Upravováno 3x, naposledy 4. března 2017 ve 21:16:06, mira


Diskuze ke článku

Vložení nového komentáře
*
*
*