Introduction
  • Getting Started
  • Configuration
  • Benchmarks
  • v6.0.0-incubating
  • v5.0.0
  • v4.0.1
  • v4.0.0
  • All Archived Releases
ASF
GitHub
Introduction
  • Getting Started
  • Configuration
  • Benchmarks
  • v6.0.0-incubating
  • v5.0.0
  • v4.0.1
  • v4.0.0
  • All Archived Releases
ASF
GitHub
  • Introduction
  • Documents

    • Getting-Started
    • Configurations
    • Benchmarks
  • Archives

    • 6.0.0-incubating
    • v5.0.0
    • v4.0.1
    • v4.0.0
    • All Archived Releases

Configurations for Auron

Runtime Configuration

Conf KeyTypeDefault ValueDescription
spark.auron.enableInputBatchStatisticsBooleantrueEnable collection of additional metrics for input batch statistics to monitor data processing performance.
spark.auron.enabled
 alternative: spark.auron.enable
BooleantrueEnable Spark Auron support to accelerate query execution with native implementations.
spark.auron.onHeapSpill.memoryFractionDouble0.9Maximum memory fraction allocated for on-heap spilling operations. This controls what portion of the available on-heap memory can be used for spilling data to disk when memory pressure occurs.
spark.auron.process.vmrss.memoryFractionDouble0.9Suggested fraction of process total memory (on-heap and off-heap) to use for resident memory. This controls the memory limit for the process's virtual memory resident set size (VMRSS).
spark.auron.shuffle.compression.targetBufSizeInteger4194304Target buffer size in bytes for shuffle compression operations. This setting controls the buffer size used during shuffle data compression, affecting both compression efficiency and memory usage. Default is 4MB (4,194,304 bytes).
spark.auron.spill.compression.codecStringlz4Compression codec used for Spark spill operations when data is written to disk due to memory pressure. Common options include lz4, snappy, and gzip. The choice affects both spill performance and disk space usage.
spark.auron.suggested.batch.memSizeInteger8388608Suggested memory size in bytes for record batches. This setting controls the target memory allocation for individual data batches to optimize memory usage and processing efficiency. Default is 8MB (8,388,608 bytes).
spark.auron.suggested.batch.memSize.multiwayMergingInteger1048576Suggested memory size in bytes for k-way merging operations. This uses a smaller batch memory size compared to regular operations since multiple batches are kept in memory simultaneously during k-way merging. Default is 1MB (1,048,576 bytes).
spark.auron.tokio.worker.threads.per.cpuInteger0Number of Tokio worker threads to create per CPU core (spark.task.cpus). Set to 0 for automatic detection based on available CPU cores. This setting controls the thread pool size for Tokio-based asynchronous operations.
spark.auron.ui.enabled
 alternative: spark.auron.ui.enable
BooleantrueEnable Spark Auron UI support to display Auron-specific metrics and statistics in Spark UI.
spark.io.compression.codecString-Compression codec used for Spark I/O operations. Common options include lz4, snappy, gzip, and zstd. The choice of codec affects both compression ratio and decompression speed.
spark.io.compression.zstd.levelInteger-Compression level for Zstandard (zstd) compression codec used in Spark I/O operations. Valid values range from 1 (fastest) to 22 (highest compression). Higher levels provide better compression but require more CPU time and memory.
spark.task.cpusInteger-Number of CPU cores allocated per Spark task. This setting determines the parallelism level for individual tasks and affects resource allocation and task scheduling. Defaults to spark.task.cpus.

Operator Supports

Conf KeyTypeDefault ValueDescription
spark.auron.enable.aggrBooleantrueEnable AggregateExec operation conversion to native Auron implementations.
spark.auron.enable.bhjBooleantrueEnable BroadcastHashJoinExec operation conversion to native Auron implementations.
spark.auron.enable.bnljBooleantrueEnable BroadcastNestedLoopJoinExec operation conversion to native Auron implementations.
spark.auron.enable.broadcastExchangeBooleantrueEnable BroadcastExchangeExec operation conversion to native Auron implementations.
spark.auron.enable.collectLimitBooleantrueEnable CollectLimitExec operation conversion to native Auron implementations.
spark.auron.enable.data.writingBooleanfalseEnable DataWritingExec operation conversion to native Auron implementations.
spark.auron.enable.expandBooleantrueEnable ExpandExec operation conversion to native Auron implementations.
spark.auron.enable.filterBooleantrueEnable FilterExec operation conversion to native Auron implementations.
spark.auron.enable.generateBooleantrueEnable GenerateExec operation conversion to native Auron implementations.
spark.auron.enable.global.limitBooleantrueEnable GlobalLimitExec operation conversion to native Auron implementations.
spark.auron.enable.local.limitBooleantrueEnable LocalLimitExec operation conversion to native Auron implementations.
spark.auron.enable.local.table.scanBooleantrueEnable LocalTableScanExec operation conversion to native Auron implementations.
spark.auron.enable.paimon.scanBooleantrueEnable PaimonScanExec operation conversion to native Auron implementations.
spark.auron.enable.projectBooleantrueEnable ProjectExec operation conversion to native Auron implementations.
spark.auron.enable.scanBooleantrueEnable ScanExec operation conversion to native Auron implementations.
spark.auron.enable.shjBooleantrueEnable ShuffledHashJoinExec operation conversion to native Auron implementations.
spark.auron.enable.shuffleExchangeBooleantrueEnable ShuffleExchangeExec operation conversion to native Auron implementations.
spark.auron.enable.smjBooleantrueEnable SortMergeJoinExec operation conversion to native Auron implementations.
spark.auron.enable.sortBooleantrueEnable SortExec operation conversion to native Auron implementations.
spark.auron.enable.take.ordered.and.projectBooleantrueEnable TakeOrderedAndProjectExec operation conversion to native Auron implementations.
spark.auron.enable.unionBooleantrueEnable UnionExec operation conversion to native Auron implementations.
spark.auron.enable.windowBooleantrueEnable WindowExec operation conversion to native Auron implementations.
spark.auron.enable.window.group.limitBooleantrueEnable WindowGroupLimitExec operation conversion to native Auron implementations.
spark.auron.forceShuffledHashJoinBooleanfalseForce replacement of all sort-merge joins with shuffled-hash joins for performance comparison and benchmarking. This setting is primarily used for testing and performance analysis, as different join strategies may be optimal for different data distributions and query patterns.
spark.auron.smjfallback.enableBooleanfalseEnable fallback from hash join to sort-merge join when the hash table becomes too large to fit in memory. This prevents out-of-memory errors by switching to a more memory-efficient join strategy when necessary.
spark.auron.smjfallback.mem.thresholdInteger134217728Memory size threshold in bytes that triggers fallback from hash join to sort-merge join. When the hash table memory usage exceeds this threshold (128MB by default), the system switches to sort-merge join to prevent memory overflow.
spark.auron.smjfallback.rows.thresholdInteger10000000Row count threshold that triggers fallback from hash join to sort-merge join. When the number of rows in the hash table exceeds this threshold, the system will switch to sort-merge join to avoid memory issues.

Data Sources

Conf KeyTypeDefault ValueDescription
spark.auron.enable.scan.orcBooleantrueEnable OrcScanExec operation conversion to native Auron implementations.
spark.auron.enable.scan.orc.timestampBooleantrueEnable OrcScanExec operation conversion with timestamp fields to native Auron implementations.
spark.auron.enable.scan.parquetBooleantrueEnable ParquetScanExec operation conversion to native Auron implementations.
spark.auron.enable.scan.parquet.timestampBooleantrueEnable ParquetScanExec operation conversion with timestamp fields to native Auron implementations.
spark.auron.files.ignoreCorruptFilesBoolean-Ignore corrupted input files, defaults to spark.sql.files.ignoreCorruptFiles
spark.auron.orc.force.positional.evolutionBooleanfalseForce ORC positional evolution mode for schema evolution operations. When enabled, column mapping will be based on column position rather than column name, which can be useful for certain schema evolution scenarios.
spark.auron.orc.schema.caseSensitive.enableBooleanfalseEnable case-sensitive schema matching for ORC files. When true, column names in the schema must match the case of columns in the ORC file exactly. When false, column name matching is case-insensitive.
spark.auron.orc.timestamp.use.microsecondBooleanfalseUse microsecond precision when reading ORC timestamp columns instead of the default millisecond precision. This provides higher temporal resolution for timestamp data but may require more storage space.
spark.auron.parquet.enable.bloomFilterBooleanfalseEnable Parquet bloom filter support for efficient equality predicate filtering. Bloom filters can quickly determine if a value might exist in a data block, reducing unnecessary I/O operations.
spark.auron.parquet.enable.pageFilteringBooleanfalseEnable Parquet page-level filtering to skip reading unnecessary data pages during query execution. This optimization can significantly improve read performance by avoiding I/O for pages that don't match filter predicates.
spark.auron.parquet.maxOverReadSizeInteger16384Maximum over-read size in bytes for Parquet file operations. This controls how much extra data can be read beyond the required data to optimize I/O operations and improve read performance.
spark.auron.parquet.metadataCacheSizeInteger5Size of the Parquet metadata cache in number of entries. This cache stores file metadata to avoid repeated metadata reads and improve query performance for frequently accessed files.

Expression/Function Supports

Conf KeyTypeDefault ValueDescription
spark.auron.cast.trimStringBooleantrueEnable automatic trimming of whitespace from string inputs before casting to numeric or boolean types. This helps prevent casting errors due to leading/trailing whitespace.
spark.auron.datetime.extract.enabledBooleanfalseEnable datetime extract operations conversion to native Auron implementations.
spark.auron.decimal.arithOp.enabledBooleanfalseEnable decimal arithmetic operations conversion to native Auron implementations.
spark.auron.enable.caseconvert.functionsBooleantrueEnable converting UPPER/LOWER string functions to native implementations for better performance. Note: May produce different outputs from Spark in special cases due to different Unicode versions.
spark.auron.forceShortCircuitAndOrBooleanfalseForce the use of short-circuit evaluation (PhysicalSCAndExprNode/PhysicalSCOrExprNode) for AND/OR expressions, regardless of whether the right-hand side contains Hive UDFs. This can improve performance by avoiding unnecessary evaluation of expressions when the result is already determined.
spark.auron.parseJsonError.fallbackBooleantrueEnable fallback to UDFJson implementation when native JSON parsing encounters errors. This ensures query execution continues even when the native JSON parser fails, at the cost of potentially slower performance.
spark.auron.udf.UDFJson.enabledBooleantrueEnable UDFJson function conversion to native Auron implementations.
spark.auron.udf.brickhouse.enabledBooleantrueEnable Brickhouse UDF conversion to native Auron implementations.
spark.auron.udf.singleChildFallback.enabledBooleantrueEnable falling-back UDF/expression with single child.

UDAF Fallback

Conf KeyTypeDefault ValueDescription
spark.auron.suggested.udaf.memUsedSizeInteger64Suggested memory usage size per row for TypedImperativeAggregate functions in bytes. This helps in memory allocation planning for UDAF operations.
spark.auron.udafFallback.enableBooleantrueEnable fallback support for UDAF and other aggregate functions that are not implemented in Auron, allowing them to be executed using Spark's native implementation.
spark.auron.udafFallback.num.udafs.trigger.sortAggInteger1Number of UDAFs to trigger sort-based aggregation, by default, all aggs containing udafs are converted to sort-based.
spark.auron.udafFallback.typedImperativeEstimatedRowSizeInteger256Estimated memory size per row for TypedImperativeAggregate functions in bytes. This estimation is used for memory planning and allocation during UDAF fallback operations.

Partial Aggregate Skipping

Conf KeyTypeDefault ValueDescription
spark.auron.partialAggSkipping.enableBooleantrueEnable partial aggregate skipping optimization to improve performance by skipping unnecessary partial aggregation stages when certain conditions are met. See issue #327 for detailed implementation.
spark.auron.partialAggSkipping.minRowsInteger-Minimum number of rows required to trigger partial aggregate skipping optimization. This prevents the optimization from being applied to very small datasets where it may not be beneficial. Defaults to spark.auron.batchSize * 5
spark.auron.partialAggSkipping.ratioDouble0.9Threshold ratio for partial aggregate skipping optimization. When the ratio of unique keys to total rows exceeds this value, partial aggregation may be skipped to improve performance.
spark.auron.partialAggSkipping.skipSpillBooleanfalseAlways skip partial aggregation when spilling is triggered to prevent memory pressure. When enabled, the system will bypass partial aggregation stages if memory spilling occurs, potentially trading off some optimization for memory stability.
Prev
Getting-Started
Next
Benchmarks