# Maru Extract-Map-Reduce

#### General Architecture

In the first iteration of Maru network proof computation is divided between several nodes. Each node runs independently and may be run on a single physical node, several physical nodes or share physical nodes depending on particular proof complexity. Communications between nodes are handled by proof streaming service. We use our in-house solution called Proxima Streams, which is capable not only of on-line delivery of streaming data, but also has historical streams. It means that each stream can be replayed from any point, making introduction of brand new proof types easy. Also our streaming system is designed to handle blockchain-based data, so have internal chain fork handling. Later iterations would add more flexibility in stream rerouting and reward system for nodes, making the whole network essentially a marketplace of zk-proofs. We design our system in a way to make it as open as possible, offloading actual computation costs on third parties and incentifying them with monetary rewards. It makes the system as decentralized as possible, as it does not matter who calculated a proof, only that the proof is correct. There is a special case for indexing, but we are striving to improve protocol to the point where even indexing could be done by a third party.

#### Preprocessing

This step is common for all data streams and shared between them. For each new block in Ethereum we will create proof for all receipts for all transactions in the block. Goal of this proof is to encapsulate all the STARK-unfriendly hashing in one place. For all receipts we will restore PMT up to the root and prove every hashing and hash inclusion to create proof of existence. Additionally for each receipt Poseidon hash would be calculated and proven, along with Bloom Filter extraction. Set of Poseidon Hashes and Bloom filters would be exposed through CTL in public inputs along with Parent and Block Hash to make both time-merging and aggregation with other proofs possible. We will send compressed proof along with open-text logs, hashes and blooms from this block to the next stage. This step encapsulates Keccak, Keccak Sponge, Logic and partially Data STARKs from the currently employed approach.

#### Bloom Filter

This step is needed to reduce the amount of computations as filtering with Bloom filters is much cheaper than proving non-inclusion. And here is the point where proof flow diverges for different clients. We can pre-calculate the footprint of Contract of Interest and Topics of Interest and prove total match with Bloom filter, extracted on previous step. Receipts that fail Bloom test would be just not included in new public inputs. Copy of proof from the previous step would be aggregated with this filtering. As bloom filters are no longer needed, only Poseidon hashes of receipts that pass Bloom test would be exposed through CTL in public inputs (Parent and Block hashes are also there). Open-text logs of selected receipts and hashes would be passed to the next stage alongside with the proof. Note that all rejected receipts are guaranteed not to contain events we are looking for. Some of the accepted ones (around 1%) still will not contain expected events, so the next step is necessary. In the current approach Bloom Filtering is not used.

#### Full Filter

On this step partial RLP decoding should be done on recipes in order to find locations of contract addresses and topics and then matched against plain representation of expected contract address and topic. As receipts here came from untrusted sources (passed along by computational network), we would need to prove their Poseidon hashes and verify against public CTL from the previous step. As with the previous step, only valid receipts hashes are included in the new CTL and proof is aggregated with previous one. Additionally, there is a set of hints to locate portions of events within logs to be parsed. This information also passed through CTL and Public Inputs. Clear-text representation of transactions, hashes and hints are still tagging along. At this point we have successfully filtered needed receipts and proven that no receipts are left behind. In the current approach Data STARK ensures that only receipts with correct metadata are processed.

#### Extraction

Here is the first explicitly user-defined step and another junktion. Client is expected to provide definition on how to extract data from receipt. Namely, we expect definition in a form of sequence of pairs (offset, size), each defining one extracted element. We would prove correct extraction from receipt, based on hint and provided definition, so each valid event in receipt would produce one tuple with data. Receipt data authenticity would be once again proven with Poseidon hash. Public Inputs of the proof from this stage would contain CTL with all extracted tuples from the block (and also Parent and Block hashes). As always through this pipeline, proof from the previous stage would be aggregated in and detailed output data attached. Clients can define multiple Extractions for different purposes from the same data. Currently, Data STARK is handling this task. Note that this is the last stage where hashing is needed to prove data integrity.

#### Map

Mapper is created around a user-defined MAP function, which takes a tuple from the previous step and transforms it into the new tuple. As the first milestone, only arithmetic operations and simple lookups are allowed. Additional operations and loops will be introduced later. Proof form would be exactly the same as in the previous step. Currently, this functionality is implemented with Arithmetics and Sum STARKs. For example, current mapping function is “volume = scale[token_id] * token_amount”.

#### Reduce

Reducer would create proof, aggregated through time with user-defined REDUCE function. This function would take the accumulator tuple and mapper tuple and create new accumulator tuple. All operations, available on previous step available here as well. Reducer is the only step in the pipeline that should be run sequentially in the general case. For this proof, Public Inputs would contain Parent Hash of the first block reducer executed at, current Block Hash, initial accumulator tuple and current accumulator tuple. Plain-text metadata would follow as well. Proof itself would be aggregated with mapper proof from the previous step and reducer proof from the previous block. Current reducer function is “new_volume = volume + old_volume” and it is handled by Arithmetics and Sum STARKs. For each new block reducer will yield new proof. There are a lot of customization on exact batching method for reducer, but it will be introduced later.

#### Final Wrap

In order to save gas while verifying proofs on chain proofs from previous state may be wrapped in order to decrease the number of public inputs, e.g. the starting block and initial tuple are always the same, they could be just hidden in the wrapper. If some fields in the accumulator tuple are only needed for intermediary calculations, they may be hidden as well. We currently use wrap for this exact purpose as well. As it is a point of customisation, the proof interface can be tailored here.

Last updated