Configurations for Auron
Runtime Configuration
| Conf Key | Type | Default Value | Description |
|---|---|---|---|
| spark.auron.enableInputBatchStatistics | Boolean | true | Enable collection of additional metrics for input batch statistics to monitor data processing performance. |
| spark.auron.enabled alternative: spark.auron.enable | Boolean | true | Enable Spark Auron support to accelerate query execution with native implementations. |
| spark.auron.onHeapSpill.memoryFraction | Double | 0.9 | Maximum memory fraction allocated for on-heap spilling operations. This controls what portion of the available on-heap memory can be used for spilling data to disk when memory pressure occurs. |
| spark.auron.process.vmrss.memoryFraction | Double | 0.9 | Suggested fraction of process total memory (on-heap and off-heap) to use for resident memory. This controls the memory limit for the process's virtual memory resident set size (VMRSS). |
| spark.auron.shuffle.compression.targetBufSize | Integer | 4194304 | Target buffer size in bytes for shuffle compression operations. This setting controls the buffer size used during shuffle data compression, affecting both compression efficiency and memory usage. Default is 4MB (4,194,304 bytes). |
| spark.auron.spill.compression.codec | String | lz4 | Compression codec used for Spark spill operations when data is written to disk due to memory pressure. Common options include lz4, snappy, and gzip. The choice affects both spill performance and disk space usage. |
| spark.auron.suggested.batch.memSize | Integer | 8388608 | Suggested memory size in bytes for record batches. This setting controls the target memory allocation for individual data batches to optimize memory usage and processing efficiency. Default is 8MB (8,388,608 bytes). |
| spark.auron.suggested.batch.memSize.multiwayMerging | Integer | 1048576 | Suggested memory size in bytes for k-way merging operations. This uses a smaller batch memory size compared to regular operations since multiple batches are kept in memory simultaneously during k-way merging. Default is 1MB (1,048,576 bytes). |
| spark.auron.tokio.worker.threads.per.cpu | Integer | 0 | Number of Tokio worker threads to create per CPU core (spark.task.cpus). Set to 0 for automatic detection based on available CPU cores. This setting controls the thread pool size for Tokio-based asynchronous operations. |
| spark.auron.ui.enabled alternative: spark.auron.ui.enable | Boolean | true | Enable Spark Auron UI support to display Auron-specific metrics and statistics in Spark UI. |
| spark.io.compression.codec | String | - | Compression codec used for Spark I/O operations. Common options include lz4, snappy, gzip, and zstd. The choice of codec affects both compression ratio and decompression speed. |
| spark.io.compression.zstd.level | Integer | - | Compression level for Zstandard (zstd) compression codec used in Spark I/O operations. Valid values range from 1 (fastest) to 22 (highest compression). Higher levels provide better compression but require more CPU time and memory. |
| spark.task.cpus | Integer | - | Number of CPU cores allocated per Spark task. This setting determines the parallelism level for individual tasks and affects resource allocation and task scheduling. Defaults to spark.task.cpus. |
Operator Supports
| Conf Key | Type | Default Value | Description |
|---|---|---|---|
| spark.auron.enable.aggr | Boolean | true | Enable AggregateExec operation conversion to native Auron implementations. |
| spark.auron.enable.bhj | Boolean | true | Enable BroadcastHashJoinExec operation conversion to native Auron implementations. |
| spark.auron.enable.bnlj | Boolean | true | Enable BroadcastNestedLoopJoinExec operation conversion to native Auron implementations. |
| spark.auron.enable.broadcastExchange | Boolean | true | Enable BroadcastExchangeExec operation conversion to native Auron implementations. |
| spark.auron.enable.collectLimit | Boolean | true | Enable CollectLimitExec operation conversion to native Auron implementations. |
| spark.auron.enable.data.writing | Boolean | false | Enable DataWritingExec operation conversion to native Auron implementations. |
| spark.auron.enable.expand | Boolean | true | Enable ExpandExec operation conversion to native Auron implementations. |
| spark.auron.enable.filter | Boolean | true | Enable FilterExec operation conversion to native Auron implementations. |
| spark.auron.enable.generate | Boolean | true | Enable GenerateExec operation conversion to native Auron implementations. |
| spark.auron.enable.global.limit | Boolean | true | Enable GlobalLimitExec operation conversion to native Auron implementations. |
| spark.auron.enable.local.limit | Boolean | true | Enable LocalLimitExec operation conversion to native Auron implementations. |
| spark.auron.enable.local.table.scan | Boolean | true | Enable LocalTableScanExec operation conversion to native Auron implementations. |
| spark.auron.enable.paimon.scan | Boolean | true | Enable PaimonScanExec operation conversion to native Auron implementations. |
| spark.auron.enable.project | Boolean | true | Enable ProjectExec operation conversion to native Auron implementations. |
| spark.auron.enable.scan | Boolean | true | Enable ScanExec operation conversion to native Auron implementations. |
| spark.auron.enable.shj | Boolean | true | Enable ShuffledHashJoinExec operation conversion to native Auron implementations. |
| spark.auron.enable.shuffleExchange | Boolean | true | Enable ShuffleExchangeExec operation conversion to native Auron implementations. |
| spark.auron.enable.smj | Boolean | true | Enable SortMergeJoinExec operation conversion to native Auron implementations. |
| spark.auron.enable.sort | Boolean | true | Enable SortExec operation conversion to native Auron implementations. |
| spark.auron.enable.take.ordered.and.project | Boolean | true | Enable TakeOrderedAndProjectExec operation conversion to native Auron implementations. |
| spark.auron.enable.union | Boolean | true | Enable UnionExec operation conversion to native Auron implementations. |
| spark.auron.enable.window | Boolean | true | Enable WindowExec operation conversion to native Auron implementations. |
| spark.auron.enable.window.group.limit | Boolean | true | Enable WindowGroupLimitExec operation conversion to native Auron implementations. |
| spark.auron.forceShuffledHashJoin | Boolean | false | Force replacement of all sort-merge joins with shuffled-hash joins for performance comparison and benchmarking. This setting is primarily used for testing and performance analysis, as different join strategies may be optimal for different data distributions and query patterns. |
| spark.auron.smjfallback.enable | Boolean | false | Enable fallback from hash join to sort-merge join when the hash table becomes too large to fit in memory. This prevents out-of-memory errors by switching to a more memory-efficient join strategy when necessary. |
| spark.auron.smjfallback.mem.threshold | Integer | 134217728 | Memory size threshold in bytes that triggers fallback from hash join to sort-merge join. When the hash table memory usage exceeds this threshold (128MB by default), the system switches to sort-merge join to prevent memory overflow. |
| spark.auron.smjfallback.rows.threshold | Integer | 10000000 | Row count threshold that triggers fallback from hash join to sort-merge join. When the number of rows in the hash table exceeds this threshold, the system will switch to sort-merge join to avoid memory issues. |
Data Sources
| Conf Key | Type | Default Value | Description |
|---|---|---|---|
| spark.auron.enable.scan.orc | Boolean | true | Enable OrcScanExec operation conversion to native Auron implementations. |
| spark.auron.enable.scan.orc.timestamp | Boolean | true | Enable OrcScanExec operation conversion with timestamp fields to native Auron implementations. |
| spark.auron.enable.scan.parquet | Boolean | true | Enable ParquetScanExec operation conversion to native Auron implementations. |
| spark.auron.enable.scan.parquet.timestamp | Boolean | true | Enable ParquetScanExec operation conversion with timestamp fields to native Auron implementations. |
| spark.auron.files.ignoreCorruptFiles | Boolean | - | Ignore corrupted input files, defaults to spark.sql.files.ignoreCorruptFiles |
| spark.auron.orc.force.positional.evolution | Boolean | false | Force ORC positional evolution mode for schema evolution operations. When enabled, column mapping will be based on column position rather than column name, which can be useful for certain schema evolution scenarios. |
| spark.auron.orc.schema.caseSensitive.enable | Boolean | false | Enable case-sensitive schema matching for ORC files. When true, column names in the schema must match the case of columns in the ORC file exactly. When false, column name matching is case-insensitive. |
| spark.auron.orc.timestamp.use.microsecond | Boolean | false | Use microsecond precision when reading ORC timestamp columns instead of the default millisecond precision. This provides higher temporal resolution for timestamp data but may require more storage space. |
| spark.auron.parquet.enable.bloomFilter | Boolean | false | Enable Parquet bloom filter support for efficient equality predicate filtering. Bloom filters can quickly determine if a value might exist in a data block, reducing unnecessary I/O operations. |
| spark.auron.parquet.enable.pageFiltering | Boolean | false | Enable Parquet page-level filtering to skip reading unnecessary data pages during query execution. This optimization can significantly improve read performance by avoiding I/O for pages that don't match filter predicates. |
| spark.auron.parquet.maxOverReadSize | Integer | 16384 | Maximum over-read size in bytes for Parquet file operations. This controls how much extra data can be read beyond the required data to optimize I/O operations and improve read performance. |
| spark.auron.parquet.metadataCacheSize | Integer | 5 | Size of the Parquet metadata cache in number of entries. This cache stores file metadata to avoid repeated metadata reads and improve query performance for frequently accessed files. |
Expression/Function Supports
| Conf Key | Type | Default Value | Description |
|---|---|---|---|
| spark.auron.cast.trimString | Boolean | true | Enable automatic trimming of whitespace from string inputs before casting to numeric or boolean types. This helps prevent casting errors due to leading/trailing whitespace. |
| spark.auron.datetime.extract.enabled | Boolean | false | Enable datetime extract operations conversion to native Auron implementations. |
| spark.auron.decimal.arithOp.enabled | Boolean | false | Enable decimal arithmetic operations conversion to native Auron implementations. |
| spark.auron.enable.caseconvert.functions | Boolean | true | Enable converting UPPER/LOWER string functions to native implementations for better performance. Note: May produce different outputs from Spark in special cases due to different Unicode versions. |
| spark.auron.forceShortCircuitAndOr | Boolean | false | Force the use of short-circuit evaluation (PhysicalSCAndExprNode/PhysicalSCOrExprNode) for AND/OR expressions, regardless of whether the right-hand side contains Hive UDFs. This can improve performance by avoiding unnecessary evaluation of expressions when the result is already determined. |
| spark.auron.parseJsonError.fallback | Boolean | true | Enable fallback to UDFJson implementation when native JSON parsing encounters errors. This ensures query execution continues even when the native JSON parser fails, at the cost of potentially slower performance. |
| spark.auron.udf.UDFJson.enabled | Boolean | true | Enable UDFJson function conversion to native Auron implementations. |
| spark.auron.udf.brickhouse.enabled | Boolean | true | Enable Brickhouse UDF conversion to native Auron implementations. |
| spark.auron.udf.singleChildFallback.enabled | Boolean | true | Enable falling-back UDF/expression with single child. |
UDAF Fallback
| Conf Key | Type | Default Value | Description |
|---|---|---|---|
| spark.auron.suggested.udaf.memUsedSize | Integer | 64 | Suggested memory usage size per row for TypedImperativeAggregate functions in bytes. This helps in memory allocation planning for UDAF operations. |
| spark.auron.udafFallback.enable | Boolean | true | Enable fallback support for UDAF and other aggregate functions that are not implemented in Auron, allowing them to be executed using Spark's native implementation. |
| spark.auron.udafFallback.num.udafs.trigger.sortAgg | Integer | 1 | Number of UDAFs to trigger sort-based aggregation, by default, all aggs containing udafs are converted to sort-based. |
| spark.auron.udafFallback.typedImperativeEstimatedRowSize | Integer | 256 | Estimated memory size per row for TypedImperativeAggregate functions in bytes. This estimation is used for memory planning and allocation during UDAF fallback operations. |
Partial Aggregate Skipping
| Conf Key | Type | Default Value | Description |
|---|---|---|---|
| spark.auron.partialAggSkipping.enable | Boolean | true | Enable partial aggregate skipping optimization to improve performance by skipping unnecessary partial aggregation stages when certain conditions are met. See issue #327 for detailed implementation. |
| spark.auron.partialAggSkipping.minRows | Integer | - | Minimum number of rows required to trigger partial aggregate skipping optimization. This prevents the optimization from being applied to very small datasets where it may not be beneficial. Defaults to spark.auron.batchSize * 5 |
| spark.auron.partialAggSkipping.ratio | Double | 0.9 | Threshold ratio for partial aggregate skipping optimization. When the ratio of unique keys to total rows exceeds this value, partial aggregation may be skipped to improve performance. |
| spark.auron.partialAggSkipping.skipSpill | Boolean | false | Always skip partial aggregation when spilling is triggered to prevent memory pressure. When enabled, the system will bypass partial aggregation stages if memory spilling occurs, potentially trading off some optimization for memory stability. |
