Scan

From MemCP
Jump to navigation Jump to search

Scan is the most important function in whole MemCP. It implements an optimized, indexed and parallelized for loop over items in a table.

It represents a data local access operation where each shard can run on a different CPU core fully exploiting cache locality inside the shard. The result of each shard-local scan is then combined to a global result.

There are two variants of scan: scan and scan_order.

Unordered parallel scan + reduce: (scan schema table filterColumns filter mapColumns map reduce neutral reduce2 isOuter)

Help for: scan
===

does an unordered parallel filter-map-reduce pass on a single table and returns the reduced result

Allowed nø of parameters:  6 - 10

 - schema (string|nil): database where the table is located
 - table (string|list): name of the table to scan (or a list if you have temporary data)
 - filterColumns (list): list of columns that are fed into filter
 - filter (func): lambda function that decides whether a dataset is passed to the map phase. You can use any column of that table as lambda parameter. You should structure your lambda with an (and) at the root element. Every equal? < > <= >= will possibly translated to an indexed scan
 - mapColumns (list): list of columns that are fed into map
 - map (func): lambda function to extract data from the dataset. You can use any column of that table as lambda parameter. You can return a value you want to extract and pass to reduce, but you can also directly call insert, print or resultrow functions. If you declare a parameter named '$update', this variable will hold a function that you can use to delete or update a row. Call ($update) to delete the dataset, call ($update '("field1" value1 "field2" value2)) to update certain columns.
 - reduce (func): (optional) lambda function to aggregate the map results. It takes two parameters (a b) where a is the accumulator and b the new value. The accumulator for the first reduce call is the neutral element. The return value will be the accumulator input for the next reduce call. There are two reduce phases: shard-local and shard-collect. In the shard-local phase, a starts with neutral and b is fed with the return values of each map call. In the shard-collect phase, a starts with neutral and b is fed with the result of each shard-local pass.
 - neutral (any): (optional) neutral element for the reduce phase, otherwise nil is assumed
 - reduce2 (func): (optional) second stage reduce function that will apply a result of reduce to the neutral element/accumulator
 - isOuter (bool): (optional) if true, in case of no hits, call map once anyway with NULL values

Characteristics

  • Filter phase is parallel and uses indexes
  • Map phase is parallel
  • 1st Reduce phase is parallel
  • 2nd Reduce phase is sequential (but only collect as much items as there are shards at max)

Examples

The following example prints all key-value pairs in tbl1(k, v):

(scan "schema" "tbl1" '() (lambda () true) '("k" "v") (lambda (k v) (print "tbl1[" k "] = " v)))

The following example finds a value for a key:

(scan "schema" "tbl1" '(k) (lambda (k) (equal?? k "12")) '("v") (lambda (v) (print "tbl1[12] = " v)))

The following example adds all values: in tbl2(weight)

(scan "schema" "tbl2" '() (lambda () true) '("weight") + 0)

/* equivalent to: */ (scan "schema" "tbl2" '() (lambda () true) '("weight") (lambda (a b) (+ a b)) 0)

Ordered scan+reduce: (scan_order schema table filterColumn filter sortcols sortdirs offset limit mapColumns map reduce neutral isOuter)

Help for: scan_order
===

does an ordered parallel filter and serial map-reduce pass on a single table and returns the reduced result

Allowed nø of parameters:  10 - 13

 - schema (string): database where the table is located
 - table (string): name of the table to scan
 - filterColumns (list): list of columns that are fed into filter
 - filter (func): lambda function that decides whether a dataset is passed to the map phase. You can use any column of that table as lambda parameter. You should structure your lambda with an (and) at the root element. Every equal? < > <= >= will possibly translated to an indexed scan
 - sortcols (list): list of columns to sort. Each column is either a string to point to an existing column or a func(cols...)->any to compute a sortable value
 - sortdirs (list): list of column directions to sort. Must be same length as sortcols. false means ASC, true means DESC
 - offset (number): number of items to skip before the first one is fed into map
 - limit (number): max number of items to read
 - mapColumns (list): list of columns that are fed into map
 - map (func): lambda function to extract data from the dataset. You can use any column of that table as lambda parameter. You can return a value you want to extract and pass to reduce, but you can also directly call insert, print or resultrow functions. If you declare a parameter named '$update', this variable will hold a function that you can use to delete or update a row. Call ($update) to delete the dataset, call ($update '("field1" value1 "field2" value2)) to update certain columns.
 - reduce (func): (optional) lambda function to aggregate the map results. It takes two parameters (a b) where a is the accumulator and b the new value. The accumulator for the first reduce call is the neutral element. The return value will be the accumulator input for the next reduce call. There are two reduce phases: shard-local and shard-collect. In the shard-local phase, a starts with neutral and b is fed with the return values of each map call. In the shard-collect phase, a starts with neutral and b is fed with the result of each shard-local pass.
 - neutral (any): (optional) neutral element for the reduce phase, otherwise nil is assumed
 - isOuter (bool): (optional) if true, in case of no hits, call map once anyway with NULL values

Characteristics

  • Filter phase is parallel and uses indexes
  • Sort phase is parallel
  • Map and Reduce are executed sequentially
  • Map and Reduce can be pruned by offset and limit

Examples

Print all items from tbl3(id, weight) ordered by weight DESC

(scan "schema" "tbl3" '() (lambda () true) '("weight") '(true) 0 -1 '("id" "weight") (lambda (id weight) (print id "," weight)))