Data Auto Sharding and Auto Indexing
To understand the following explainations, you should read about Shards, RecordIDs, Main Storage, Delta Storage first. An understanding of Scan can also help.
Auto-Indexing
Auto Indexing works the following way:
- Whenever there is a scan over a table shard that has certain boundaries (e.g.
WHERE col1 = value1 AND col2 = value2 ORDER BY col3 ASC
), there exists a "desired index" (e.g. [col1 col2 col3]) - Every desired index is created shard-locally but marked "inactive"
- There is a cost model consisting of three components:
- The cost of scanning without index
- The cost of building the index
- The cost of scanning with the index
- After about two indexed scans, the cost of indexing will be ammortized
- before the amortization threshold is met, do full scans
- as soon as the amortization threshold is met, build the index
Indexes are (currently) only built over the main storage. Delta storage items are currently appended to the scan in an unordered way.
An index is a integer-compressed list of recordIds, meaning: instead of storing the values in the index itself, the index is a cache-compressed pointerless structure. This list is sorted by the sort criterion. This makes storing indexes so cheap.
A 60,000 item index would compress to 16 bits integer width, so the size of the index would be 120 KB.
Auto-Sharding
Auto-Sharding is done similar to Auto-Indexing:
- Whenever there is a scan over a table where a sharding scheme along a column
col
would be benefitial, the "partitioning score" forcol
is increased. - After a while (~15min), there is a reevaluation if the shards of a table should be repartitioned
- The shard dimensions are chosen proportional to the "partitioning score" from the previous steps
- The number of shard dimensions can be limited in the Settings
- Then, data is reshuffled. This may take a while for bigger tables.
Partitioning has the following positive effects:
- A scan with boundaries matching the partitioning scheme will only touch those partitions with relevant data in it
- INSERTs with unique keys will be able to scale whenever all unique keys are covered within the partitioning schema
By using insert or scan very often, it will increase the partitioning score on the right dimensions, so auto-sharding in MemCP is self learning.