Data Auto Sharding and Auto Indexing: Difference between revisions

Latest revision as of 08:39, 26 January 2026

To understand the following explainations, you should read about Shards, RecordIDs, Main Storage, Delta Storage first. An understanding of Scan can also help.

Auto-Indexing

Auto Indexing works the following way:

Whenever there is a scan over a table shard that has certain boundaries (e.g. WHERE col1 = value1 AND col2 = value2 ORDER BY col3 ASC), there exists a "desired index" (e.g. [col1 col2 col3])
Every desired index is created shard-locally but marked "inactive"
There is a cost model consisting of three components:
- The cost of scanning without index
- The cost of building the index
- The cost of scanning with the index
After about two indexed scans, the cost of indexing will be ammortized
- before the amortization threshold is met, do full scans
- as soon as the amortization threshold is met, build the index

Indexes are (currently) only built over the main storage. Delta storage items are currently appended to the scan in an unordered way.

An index is a integer-compressed list of recordIds, meaning: instead of storing the values in the index itself, the index is a cache-compressed pointerless structure. This list is sorted by the sort criterion. This makes storing indexes so cheap.

A 60,000 item index would compress to 16 bits integer width, so the size of the index would be 120 KB.

Auto-Sharding

Auto-Sharding is done similar to Auto-Indexing:

Whenever there is a scan over a table where a sharding scheme along a column col would be benefitial, the "partitioning score" for col is increased.
After a while (~15min), there is a reevaluation if the shards of a table should be repartitioned
The shard dimensions are chosen proportional to the "partitioning score" from the previous steps
The number of shard dimensions can be limited in the Settings
Then, data is reshuffled. This may take a while for bigger tables.

Partitioning has the following positive effects:

A scan with boundaries matching the partitioning scheme will only touch those partitions with relevant data in it
INSERTs with unique keys will be able to scale whenever all unique keys are covered within the partitioning schema

By using insert or scan very often, it will increase the partitioning score on the right dimensions, so auto-sharding in MemCP is self learning.

Parallel Sharding

MemCP automatically creates parallel shards for optimal query performance across multiple CPU cores.

Bulk Insert Optimization

When inserting large amounts of data, MemCP automatically:

Splits bulk inserts into chunks of 60,000 rows (configurable via ShardSize)
Creates new shards on-the-fly when the current shard fills up
Rebuilds full shards in parallel using background goroutines

This enables insertion of millions of rows while maintaining parallel query execution.

Automatic Repartitioning

During rebuild(), if no partitioning hints exist but data exceeds ShardSize:*MemCP calculates the optimal number of shards based on data size

Creates at least 2 × NumCPU shards for good parallelism
Uses the first column for round-robin distribution

Performance Characteristics

With parallel sharding enabled: - Query CPU utilization: 1500-1900% (15-19 cores on a 24-core machine) - Speedup: 3-6x compared to single-threaded execution - Throughput: ~0.04 µs/row for COUNT/SUM operations

Configuration

Setting	Default	Description
ShardSize	60,000	Rows per shard before splitting
PartitionMaxDimensions	10	Maximum partitioning dimensions

@@ Line 1: / Line 1: @@
+To understand the following explainations, you should read about [[Shards, RecordIDs, Main Storage, Delta Storage]] first. An understanding of [[Scan]] can also help.
 == Auto-Indexing ==
+Auto Indexing works the following way:
+* Whenever there is a [[scan]] over a table shard that has certain boundaries (e.g. <code>WHERE col1 = value1 AND col2 = value2 ORDER BY col3 ASC</code>), there exists a "'''desired index'''" (e.g. [col1 col2 col3])
+* Every desired index is created shard-locally but marked "inactive"
+* There is a cost model consisting of three components:
+** The cost of scanning without index
+** The cost of building the index
+** The cost of scanning with the index
+* After about two indexed scans, the cost of indexing will be ammortized
+** before the amortization threshold is met, do full scans
+** as soon as the amortization threshold is met, build the index
+Indexes are (currently) only built over the main storage. Delta storage items are currently appended to the scan in an unordered way.
+An index is a [[Integer Compression|integer-compressed]] list of recordIds, meaning: instead of storing the values in the index itself, the index is a cache-compressed pointerless structure. This list is sorted by the sort criterion. This makes storing indexes so cheap.
+A 60,000 item index would compress to 16 bits integer width, so the size of the index would be 120 KB.
 == Auto-Sharding ==
+Auto-Sharding is done similar to Auto-Indexing:
+* Whenever there is a [[scan]] over a table where a sharding scheme along a column <code>col</code> would be benefitial, the "partitioning score" for <code>col</code> is increased.
+* After a while (~15min), there is a reevaluation if the shards of a table should be repartitioned
+* The shard dimensions are chosen proportional to the "partitioning score" from the previous steps
+* The number of shard dimensions can be limited in the [[Settings]]
+* Then, data is reshuffled. This may take a while for bigger tables.
+Partitioning has the following positive effects:
+* A scan with boundaries matching the partitioning scheme will only touch those partitions with relevant data in it
+* INSERTs with unique keys will be able to scale whenever all unique keys are covered within the partitioning schema
+By using insert or scan very often, it will increase the partitioning score on the right dimensions, so auto-sharding in MemCP is self learning.
+== Parallel Sharding==
+MemCP automatically creates parallel shards for optimal query performance across multiple CPU cores.
+===Bulk Insert Optimization===
+When inserting large amounts of data, MemCP automatically:
+#'''Splits bulk inserts into chunks''' of 60,000 rows (configurable via ShardSize)
+#'''Creates new shards on-the-fly''' when the current shard fills up
+#'''Rebuilds full shards in parallel''' using background goroutines
+This enables insertion of millions of rows while maintaining parallel query execution.
+===Automatic Repartitioning===
+During <code>rebuild()</code>, if no partitioning hints exist but data exceeds ShardSize:*MemCP calculates the optimal number of shards based on data size
+*Creates at least <code>2 × NumCPU</code> shards for good parallelism
+*Uses the first column for round-robin distribution
+===Performance Characteristics===
+With parallel sharding enabled: - Query CPU utilization: 1500-1900% (15-19 cores on a 24-core machine) - Speedup: 3-6x compared to single-threaded execution - Throughput: ~0.04 µs/row for COUNT/SUM operations
+===Configuration===
+{| class="wikitable"
+!Setting
+! Default
+!Description
+|-
+|ShardSize
+|60,000
+| Rows per shard before splitting
+|-
+|PartitionMaxDimensions
+|10
+|Maximum partitioning dimensions
+|}

Data Auto Sharding and Auto Indexing: Difference between revisions

Latest revision as of 08:39, 26 January 2026

Contents

Auto-Indexing

Auto-Sharding

Parallel Sharding

Bulk Insert Optimization

Automatic Repartitioning

Performance Characteristics

Configuration

Navigation menu

Data Auto Sharding and Auto Indexing: Difference between revisions

Latest revision as of 08:39, 26 January 2026

Auto-Indexing

Auto-Sharding

Parallel Sharding

Bulk Insert Optimization

Automatic Repartitioning

Performance Characteristics

Configuration

Navigation menu

Search