Skip to content

Storage IO

Capabilities - async - cachable

Some implemented as serial only, some implemented with async.

Async uses a read buffer to throttle the reads and prevent complex processing not consuming reads fast enough from blowing memory limits.

Caching

Up to two layers of cache: - Buffer Pool - Remote Cache

Buffer Pool

One of the few process-scoped resources (i.e. is shared across connections and queries). This is an in memory cache of files and blobs. This is implemented on top of a memory pool.

Buffering

Note that Buffer Pool is covered in Caching

Buffering is used to enable prefetching of files of blobs. This is most useful with remote filestores (e.g. GCS) which are generally quite slow to retrieve data from. Buffering allows us to use multiple workers to fetch data, and populate buffers as a strategy to ensure data is being fed into the system at similar rate to the processing of this data.

We use a memory pool here to limit the amount of data waiting and prevent the readers from exhausting system resources, we also use the memory pool to remove the need to transfer large quantities of data between threads, we pass the pool reference.

Monitoring

stalls The number of items a process has had to wait before it could continue.

write to buffer stalls indicates the workers reading from storage are having to wait before they can add their payload to the read buffers.

read from buffer stalls indicate the processing of the data is having to wait for more data to be added to the buffers.

Terminology

stall when work is still to be done but a process is waiting on other part of the system.