Data Auto Sharding and Auto Indexing

From MemCP
Revision as of 12:25, 20 May 2024 by Carli (talk | contribs) (→‎Auto-Indexing)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

To understand the following explainations, you should read about Shards, RecordIDs, Main Storage, Delta Storage first. An understanding of Scan can also help.

Auto-Indexing

Auto Indexing works the following way:

  • Whenever there is a scan over a table shard that has certain boundaries (e.g. WHERE col1 = value1 AND col2 = value2 ORDER BY col3 ASC), there exists a "desired index" (e.g. [col1 col2 col3])
  • Every desired index is created shard-locally but marked "inactive"
  • There is a cost model consisting of three components:
    • The cost of scanning without index
    • The cost of building the index
    • The cost of scanning with the index
  • After about two indexed scans, the cost of indexing will be ammortized
    • before the amortization threshold is met, do full scans
    • as soon as the amortization threshold is met, build the index

Indexes are (currently) only built over the main storage. Delta storage items are currently appended to the scan in an unordered way.

An index is a integer-compressed list of recordIds, meaning: instead of storing the values in the index itself, the index is a cache-compressed pointerless structure. This list is sorted by the sort criterion. This makes storing indexes so cheap.

A 60,000 item index would compress to 16 bits integer width, so the size of the index would be 120 KB.

Auto-Sharding

Auto-Sharding is done similar to Auto-Indexing:

  • Whenever there is a scan over a table where a sharding scheme along a column col would be benefitial, the "partitioning score" for col is increased.
  • After a while (~15min), there is a reevaluation if the shards of a table should be repartitioned
  • The shard dimensions are chosen proportional to the "partitioning score" from the previous steps
  • The number of shard dimensions can be limited in the Settings
  • Then, data is reshuffled. This may take a while for bigger tables.

Partitioning has the following positive effects:

  • A scan with boundaries matching the partitioning scheme will only touch those partitions with relevant data in it
  • INSERTs with unique keys will be able to scale whenever all unique keys are covered within the partitioning schema

By using insert or scan very often, it will increase the partitioning score on the right dimensions, so auto-sharding in MemCP is self learning.