Optimizing Delta Sharing performance requires understanding data transfer patterns, file formats, and query execution strategies. This guide covers techniques to maximize throughput and minimize latency when sharing large datasets.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/delta-io/delta-sharing/llms.txt
Use this file to discover all available pages before exploring further.
Large Dataset Best Practices
When sharing large datasets (hundreds of GBs to TBs), implement these optimization strategies:Partition Your Data
Partitioning significantly improves query performance by enabling partition pruning:- Reduced data scanning - Only relevant partitions are read
- Faster query execution - Fewer files to process
- Lower network transfer - Less data moved over the network
- Cost savings - Reduced compute and data transfer costs
File Size Optimization
Optimal file sizes balance parallelism and overhead: Recommended File Sizes:- Minimum: 128 MB - Avoids excessive file listing overhead
- Optimal: 256 MB to 1 GB - Good balance for most workloads
- Maximum: 2 GB - Prevents memory issues in readers
Limit Hint Usage
UselimitHint to reduce data transfer when you only need a subset:
Delta Format vs Parquet Format
Delta Sharing supports two response formats with different performance characteristics:Response Format Comparison
Parquet Format (Default)
Parquet Format (Default)
When to Use:Response Structure:
- Legacy clients (delta-sharing-spark < 3.1)
- Simple tables without advanced Delta features
- Maximizing client compatibility
- Compatible with all Delta Sharing clients
- Server converts Delta metadata to simplified format
- Limited to basic Delta features (minReaderVersion = 1)
- No support for Deletion Vectors or Column Mapping
Delta Format (Advanced)
Delta Format (Advanced)
When to Use:Request Header:Response Structure:
- Modern clients (delta-sharing-spark >= 3.1)
- Tables with advanced Delta features
- Need for Deletion Vectors or Column Mapping
- Performance optimization through native Delta reading
- Supports advanced Delta features (minReaderVersion > 1)
- Native Delta log structure preserved
- Better performance with Delta Kernel
- Deletion Vectors for efficient updates
- Column Mapping for schema evolution
Format Selection Strategy
Performance ImpactDelta format can provide 2-5x better query performance for tables with:
- Deletion Vectors (updates/deletes)
- Column Mapping (schema evolution)
- Large partition counts
Batch Conversion for Memory Optimization
For large tables that don’t fit in memory, use batch conversion:Memory-Efficient Data Loading
- Fetches file metadata from server
- Downloads and converts Parquet files in smaller batches
- Concatenates results incrementally
- Reduces peak memory consumption
| Method | Peak Memory | Best For |
|---|---|---|
| Standard | ~2x table size | Tables < 10 GB |
| Batch Conversion | ~1.2x table size | Tables > 10 GB |
| Streaming (Spark) | ~100 MB per partition | Tables > 100 GB |
Spark Streaming for Continuous Data
Predicate Pushdown Optimization
Predicate pushdown filters data at the source, reducing network transfer:SQL Predicate Hints
| Operator | Example | Use Case |
|---|---|---|
= | col = 123 | Exact match |
>, < | col > 1000 | Range queries |
>=, <= | col >= '2024-01-01' | Inclusive ranges |
<> | col <> 'US' | Exclusion |
IS NULL | col IS NULL | Null checks |
IS NOT NULL | col IS NOT NULL | Non-null checks |
JSON Predicate Hints (Recommended)
JSON predicates provide structured, type-safe filtering:- Type-safe filtering with explicit
valueType - Complex boolean logic (
and,or,not) - Easier for servers to parse and optimize
- Preferred over SQL predicateHints
Version and Time Travel Performance
Querying historical versions efficiently:Version-Based Queries
- Version queries: Fast - direct metadata lookup
- Timestamp queries: Slower - requires timestamp-to-version mapping
- Latest version: Fastest - no additional lookups
Change Data Feed (CDF) Performance
- Use
ending_versionto limit query window - Enable
convert_in_batchesfor large change sets - Query smaller version ranges for faster results
- Use Delta format for tables with frequent updates
Caching Strategies
Implement caching to reduce repeated data transfers:Client-Side Caching
File-Level Caching
Delta Sharing uses fileid for consistent file identification:
- Cache downloaded Parquet files by
id - Check
expirationTimestampbefore reusing URLs - Implement LRU eviction for cache size management
- Use local SSD for cache storage
Network and Connection Optimization
Connection Pooling
Parallel Downloads
Monitoring and Profiling
Track performance metrics to identify bottlenecks:Query Timing
Server-Side Metrics
Monitor Delta Sharing server performance: Key Metrics:- Request latency (p50, p95, p99)
- Requests per second
- Error rates (401, 403, 500)
- Data transfer volume
- Active connections
Performance Tuning Checklist
Optimization Checklist
Optimization Checklist
Data Organization:
- Partition tables by common query dimensions
- Maintain optimal file sizes (256 MB - 1 GB)
- Use Z-ordering for multi-dimensional queries (Delta only)
- Regularly compact small files
- Use predicateHints or jsonPredicateHints for filtering
- Set appropriate limitHint values
- Enable Delta format for advanced features
- Use batch conversion for large tables
- Implement connection pooling
- Use parallel downloads for multiple files
- Cache frequently accessed data
- Monitor and optimize bandwidth usage
- Configure appropriate check intervals
- Limit versions per RPC (10-20)
- Use checkpoints for fault tolerance
- Monitor consumer lag
- Track query latencies
- Monitor data transfer volumes
- Set up alerts for performance degradation
- Profile slow queries regularly