Skip to content

Watcher Memory Improvements#

In 0.92.0, the watcher dropped its internal buffering of state and started to fully delegating any potential buffering to the associated Store.

This can cause a decent memory use reduction for direct users of watcher, but also (somewhat unintuitively) for users of reflectors and stores.

In this post, we explore the setup, current solutions, and some future work. It has been updated in light of 0.92.1.

Runtime Memory Performance#

The memory profile of any application using kube::runtime is often dominated by the memory usage from buffers of the Kubernetes objects that is needed to be watched. The main offender is the reflector, with a literal type Cache<K> = Arc<RwLock<AHashMap<ObjectRef<K>, Arc<K>>>> hiding internally as the lookup used by Stores and Controllers.

We have lots of advice on how to reduce the size of this cache. The optimization guide shows how to:

  • minimize what you watch :: by constraining watch parameters with selectors
  • minimize what you ask for :: use metadata_watcher on watches that does not need the .spec
  • minimize what you store :: by dropping fields before sending to stores

These are quick and easy steps improve the memory profile that is worth checking out (the benefits of doing these will further increase in 0.92).

Improving what is stored in the Cache above is important, but it is not the full picture...

The Watch API#

The Kubernetes watch API is an interesting beast. You have no guarantees you'll get every event, and you must be able to restart from a potentially new checkpoint without telling you what changes happened in the downtime. This is mentioned briefly in Kubernetes API concepts as an implication of its 410 Gone responses.

When 410 Gone responses happen we need to trigger a re-list, and wait for all data to come through before we are back in a "live watching" mode that is caught up with reality. This type of API consumption is problematic when you need to do work with reflectors/caches where you are generally storing complete snapshots in memory for a worker task. Controllers are effectively forced to treat every event as a potential change, and chase reconciler#idempotency as a work-around for not having guaranteed delivery.

Let's focus on caches. To simplify these problems for users we have created certain guarantees in the abstractions of kube::runtime.

Runtime Guarantees#

The watcher up until 0.92.0 has maintained a guarantee we have casually referred to as watcher atomicity:

watcher atomicity < 0.92.0

You only see a Restarted on re-lists once every object has been received through an api.list.
Watcher events will pause between a de-sync / restart and a Restarted. See watcher::Event@0.91.

This property meant that stores could in-turn provide their own guarantee very easily:

Store completeness

Store always presents the full state once initialised. During a relist, previous state is presented.
There is no down-time for a store during relists, and its Cache is replaced atomically in a single locked step.

This property is needed for Controllers who rely on complete information and will kick in once the future from Store::wait_until_ready resolves.

Store Consequences#

If we do all the buffering on the watcher side, then achieving the store completeness guarantee is a rather trivial task to accomplish.

Up until 0.91 this was handled in Store::apply_watcher_event@0.91 as with a *self.store.write() = new_objs on the old Restarted event:

// 0.91 source:
        match event {
            watcher::Event::Applied(obj) => {
                let key = obj.to_object_ref(self.dyntype.clone());
                let obj = Arc::new(obj.clone());
                self.store.write().insert(key, obj);
            }
            watcher::Event::Deleted(obj) => {
                let key = obj.to_object_ref(self.dyntype.clone());
                self.store.write().remove(&key);
            }
            watcher::Event::Restarted(new_objs) => {
                let new_objs = new_objs
                    .iter()
                    .map(|obj| (obj.to_object_ref(self.dyntype.clone()), Arc::new(obj.clone())))
                    .collect::<AHashMap<_, _>>();
                *self.store.write() = new_objs;
            }
        }

Thus, on a relist/restart:

  1. watcher pages were buffered internally
  2. entered Restarted arm, where each object got cloned while creating new_objs
  3. store (containing the complete old data) swapped at the very end

so you have a moment with 3x potential peak memory use (2x should have been the max).

On top of that, the buffer in the watcher was not always released (quote from discord):

The default system allocator never returns the memory to the OS after the burst, even if the objects are dropped. Since the initial list fetch happens sporadically you get a higher RSS usage together with the memory spike. Solving the burst will solve this problem, and reflectors and watchers can be started in parallel without worrying of OOM killers.
The allocator does not return the memory to the OS since it treats it as a cache. This is mitigated by using jemalloc with some tuning, however, you still get the memory burst so our solution was to use jemalloc + start the watchers sequentially. As you can imagine it's not ideal.

So in the end you might actually be holding on to between 2x and 3x the actual store size at all times.

watcher guarantee was designed for the store guarantee

If you were using watcher without reflector, you were the most affected by this excessive caching. You might not have needed watcher atomicity, as it was primarily designed to facilitate store completeness.

Watcher Consequences#

If you were just watching data on 0.91.0 (not using stores), the buffering is completely necessary if you just want to react to events about individual objects without considering the wider dataset.

Your peak memory use for a single watcher (with all other things considered negligible) is going to scale with the size of the complete dataset because watcher was buffering ALL pages, whereas it really should only scale with the page size that you ask for objects returned in.

Change in 0.92#

The change in 0.92.0 is primarily to stop buffering events in the watcher, and present new watcher events that allows a store to achieve the Store completeness guarantee.

As it stands the Store::apply_watcher_event@0.92 now is slightly smarter and achieves the same guarantee:

// 0.92 source
        match event {
            watcher::Event::Apply(obj) => {
                let key = obj.to_object_ref(self.dyntype.clone());
                let obj = Arc::new(obj.clone());
                self.store.write().insert(key, obj);
            }
            watcher::Event::Delete(obj) => {
                let key = obj.to_object_ref(self.dyntype.clone());
                self.store.write().remove(&key);
            }
            watcher::Event::Init => {
                self.buffer = AHashMap::new();
            }
            watcher::Event::InitApply(obj) => {
                let key = obj.to_object_ref(self.dyntype.clone());
                let obj = Arc::new(obj.clone());
                self.buffer.insert(key, obj);
            }
            watcher::Event::InitDone => {
                let mut store = self.store.write();
                std::mem::swap(&mut *store, &mut self.buffer);
                self.buffer = AHashMap::new();
                /// ...
            }
        }

Thus, on a restart, objects are passed one-by-one up to the store, and buffered therein. When all objects are received, the buffers are swapped (meaning you use at most 2x the data). The blank buffer re-assignment also forces de-allocation* of the temporary self.buffer.

Preparing for StreamingLists

Note that the new partial InitApply event only pass up individual objects, not pages. This is to prepare for the 1.27 Alpha StreamingLists feature which also passed individual events. Once this becomes available for even our minimum kubernetes-version we can make this the default - reducing page buffers further - exposing the literal api results rather than pages (of default 500 objects). In the mean time, we send pages through item-by-item to avoid a breaking change in the future (and also to avoid exposing the confusing concept of flattened/unflattened streams).

Results#

The initial synthetic benchmarks saw 60% reductions when using stores, and 80% when not using stores (when there's nothing to cache), with further incremental improvements when using the StreamingList strategy.

Ad-hoc Benchmarks

The ad-hoc synthetic benchmarks are likely unrealistic for real world scenarios. The original 0.92.0 release had a bug affecting benchmarks, so many of the linked posts may be invalid / out-of-date. How much you can get out of this will depend on a range of factors from allocator choice to usage patterns.

So far, we have seen controllers with a basically unchanged profile, some with small improvements in the 10-20% range, one 50% drop in a real-world controller from testing (EDIT: which is still sustained after the 0.92.1 bugfix with page size).

In the current default ListWatch InitialListStrategy, the implicit default Config::page_size of 500 will undermine this optimization somewhat, because individual pages are still kept in the watcher while they are being sent out one-by-one. Setting the page size to 50 was necessary for me to get anything close to the benchmarks.

Page Size Marginal Gains

Lowering the page size below 50 did see further marginal gains (~5%ish from limited testing), but this will also increase API calls (/list next page). It will be interesting to see how well the streaming lists change will fare in the end (as it effectively functions as setting the page size to 1 as far as internal buffering is concerned).

So for now; YMMV. Try setting the page_size, and chat about / share your results!

Examples#

Two example results from my own deployment testing (by checking memory use after initialisation 5m in using standard kubelet metrics) showed uneven gains.

Update after 0.92.1

This post has been edited in light of 0.92.1 which casted the 0.92.0 release in overly favourable light. 0.92.0 dropped pages and this impacted the measurements.

Optimized Metadata Controller#

A metadata controller watching 2000 objects (all in stores), doing 6000 reconciles an hour.

45MB memory on 0.91.0, ~20MB on 0.92.1.

This saw the biggest improvement, dropping ~50% of its memory usage. This is also a tiny controller with basically no other cached data though, it is doing all the biggest optimization tricks (metadata_watcher, page_size 50, pruning of managed fields) so the page buffering actually constituted the majority of the memory use (a perhaps uncommon situation).

KS Controller#

A controller for flux kustomizations storing and reconciling about 200 ks objects without any significant optimization techniques.

~65MB memory on 0.91.0, ~65MB on 0.92.1.

No improvements overall on this one despite setting page size down to 50.

Thoughts for the future#

The peak 2x overhead here does hint at a potential future optimization; allowing users to opt-out of the store completeness guarantee.

Store Tradeoffs

It is possibly to build custom stores that avoids the buffering of objects on restarts by dropping the store completeness guarantee. This is not practical yet for Controller uses, due to requirements on Store types, but perhaps this could be made generic/opt-out in the future. It could be a potential flattener of the peak usage.

As a step in the right direction, we would first like to get better visibility of our memory profile with some automated benchmarking. See kube#1505 for details.

It would also be good to better understand the choices of allocators here and their implications for some of these designs.

Breaking Change#

Users not matching on watcher::Event or building custom stores should not ever need to interact with this and should get the memory improvements for free.

If you are using a custom store please see the new watcher::Event and make the following changes in match arms:

  • Applied -> Apply
  • Deleted -> Delete
  • Restarted -> change to InitApply with 2 new arms:
    • Create new arms for Init marking start (allocate a temporary buffer)
    • buffer objects from InitApply (you get one object at a time, no need to loop)
    • Swap store in InitDone and deallocate the old buffer

See the above Store::apply_watcher_event code for pointers.

Previous Improvements#

Memory optimization is a continuing saga and while the numbers herein are considerable, they build upon previous work:

  1. Metadata API support in 0.79.0
  2. Ability to pass minified streams into Controller in 0.81.0 documented in streams
  3. Controller::owns relation moved to lighter metadata watches in 0.84.0
  4. Default pagination of watchers in 0.84.0 via #1249
  5. initial streaming list support in 0.86.0
  6. Remove buffering in watcher in 0.92.0 - today 🎉

Thanks to everyone who contribute to kube!