Understanding Data Compression
Transformation Hub compression settings affect data in two general places, communication and storage. Specifically, this refers to data stored on disk, in Kafka topic partitions, and data that is in transit.
- All external producers such as connectors, collectors, and internal producers, like routing and CEF2Avro processors, compress data before sending it.
- For data in transit, data compression is controlled by the producer's configuration.
Data Consumers
There is no property that controls data compression on consumers. Consumers read metadata from each message, which indicates the correct decompression algorithm to use. Since this is evaluated on a message-by-message basis, the consumer's behavior does not depend on which topic it is consuming from. A single topic might contain messages which have been compressed with different compression algorithms (also referred to as compression types or codecs).
Data Storage (Data at Rest)
The algorithm used to compress stored data is determined by the topic configuration. All Transformation Hub topics, except th-arcsight-avro and mf-event-avro-enriched, currently use the default compression type, which is the same as that used by producer. This configuration choice means the topic will retain the original compression algorithm set by the producer. By leaving this as producer-defined, there is flexibility for the producer to send either compressed (using any supported codec) or uncompressed data.
The mf-event-avro-enriched topic is an exception because the database scheduler reads from this topic, but does not yet have support for reading messages encoded with the ZStandard (zstd) compression algorithm. Therefore, there is a specific, out-of-the-box value for this topic, to insure that the database scheduler can read it, no matter what over-the-wire compression was used.
| Topic | Compression Type | Transformation Hub Version Support |
|---|---|---|
|
All topics except |
producer (default) | 3.4.0 and earlier (3.5.0 and earlier for mf-event-avro-enriched) |
th-arcsight-avro
|
gzip | 3.4.0 and 3.3.0 |
th-arcsight-avro
|
uncompressed | 3.2.0 and earlier |
mf-event-avro-enriched
|
gzip | 3.5.0 |
Configuring Compression
There are two places in the Kafka architecture where compression can be configured: the producer and the topic.
- Producer-level compression is set on the producer; for example, in SmartConnector Transformation Hub destination parameters. For producers that reside inside Transformation Hub, such as routing and stream processors, the compression algorithm is configured on the Transformation Hub configuration page, during deployment.
- Topic-level compression can be set with Kafka Manager (using Topic > Update Config Menu); however, it is strongly recommended that settings be left at default values.
Compression Types
While Kafka supports a handful of compression types, Transformation Hub implements only two types: gzip and zstd.
- gzip: By default, gzip is used for Transformation Hub routing and stream processors, as well as for SmartConnectors. This is for backward compatibility and might change in a future release.
- zstd: Testing has shown that zstd uses less bandwidth, storage, and CPU resources than gzip. For bandwidth constrained networks, higher EPS is typically seen when using zstd; however actual results are unique to each environment. Third-party Java producers should use kafka-clients version 2.1.0 or later, for zstd support. ArcSight consumers compatible with zstd include Logger 7.0, ESM 7.2, IDI 1.1, or later.