Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tailsamplingprocessor] Support external decision cache implementations #37035

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Logiraptor
Copy link

Description

Adding a feature. This PR adds support for external implementations of the decision cache. This allows the collector (or another service using the processor) to supply an alternative decision cache based on alternative algorithms or external services like memcached without needing to explicitly add support for all possible options in the processor.

It re-uses the existing function option pattern and only exposes two options for now: WithSampledDecisionCache and WithNonSampledDecisionCache. I've avoided exporting other options to avoid bloating the external interface without a concrete use case. The majority of changes are cleanup from the refactoring to move Option values into the Config struct instead of in a variadic parameter on newTracesProcessor.

@Logiraptor Logiraptor requested review from jpkrohling and a team as code owners January 6, 2025 19:16
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Jan 6, 2025
@portertech
Copy link
Contributor

@Logiraptor please add a changlog entry to .chloggen/.

@jpkrohling
Copy link
Member

I'd prefer another approach to this, closer to what we have with the auth extensions, to where we plan to go with policies (#31582), and how we plan to support middlwares: via extensions.

Concretely, the config would look like this:

extensions:
    cacheimpl:
      url: ...
      max: 1_000

processors:
    tailsampling:
    cache: cacheimpl

To accomplish that, we'd need:

  • the "cache extension" interface at extension/cache (perhaps even at core instead of contrib)
  • one cache implementation (perhaps the current one we have can be copied over there)
  • change the processor to load the proper extension. Here's how we do with auth:

https://github.com/open-telemetry/opentelemetry-collector/blob/ced38e8af2ae586363e3b2a34990e906a1227ccb/config/confighttp/confighttp.go#L199-L204

https://github.com/open-telemetry/opentelemetry-collector/blob/ced38e8af2ae586363e3b2a34990e906a1227ccb/config/configauth/configauth.go#L51-L58

@Logiraptor
Copy link
Author

@jpkrohling Thanks for the feedback! I'm looking into that now, and I want to make sure I understand since it's my first time contributing to this repo. 🙏

So right now we have this interface:

// Cache is a cache using a pcommon.TraceID as the key and any generic type as the value.
type Cache[V any] interface {
	// Get returns the value for the given id, and a boolean to indicate whether the key was found.
	// If the key is not present, the zero value is returned.
	Get(id pcommon.TraceID) (V, bool)
	// Put sets the value for a given id
	Put(id pcommon.TraceID, v V)
	// Delete deletes the value for the given id
	Delete(id pcommon.TraceID)
}

When I create an extension, are you envisioning the same interface, or something more generic? I believe the key was not made generic intentionally to allow for optimizations in the implementation (using the right side of the id as a key).

If the cache should be more generic, then what about something that takes string keys and []byte values?

If the cache should not be more generic, then I'm wondering how useful it would be as an extension, since it presumable wouldn't be used by other components. I could be wrong, like I said this is my first contribution here so I'm still learning.

What do you think?

@Logiraptor
Copy link
Author

I'll also note that even though the cache is generic on the value type, it's only ever used as Cache[bool].

@jpkrohling
Copy link
Member

If the cache should not be more generic, then I'm wondering how useful it would be as an extension, since it presumable wouldn't be used by other components. I could be wrong, like I said this is my first contribution here so I'm still learning.

I think this is the case, and the name should therefore be something like TraceDecisionCache, or something like that. We can reuse it as Cache[string] in the load-balancing exporter to always route traces to the same backend even after a ring change.

@Logiraptor
Copy link
Author

Logiraptor commented Jan 8, 2025

@jpkrohling Got it, ok so working through this I think there's another issue. I don't see how I can maintain type safety through the extension system while also supporting different value types. Specifically, the CreateFunc passed to extension.NewFactory must return a concrete type of something, and since it has no access to the downstream user of this extension, there's no way for it to know which type to return. (ie Cache[bool] or Cache[string])

Similarly, when the tsp tries to get the extension, it will have to type assert the value to some concrete type. So it seems these two pieces of code must implicitly agree on a single concrete type. In that case there's no benefit to using a type parameter. I think the only real solution here is either to forgo type safety using any or to use a type which is more generally useful like []byte. In the latter case, it's then up to each component using the cache to define some codec for converting to/from []byte. This seems to be what the storage extensions do, since they have signatures like

	Get(ctx context.Context, key string) ([]byte, error)
	Set(ctx context.Context, key string, value []byte) error

So now I'm thinking that this cache extension should be a very similar interface with different semantics. e.g. a cache does not guarantee durability between Get / Set whereas a storage does.

Thoughts?

@Logiraptor
Copy link
Author

Logiraptor commented Jan 8, 2025

One way I've found which I think could work is like so

// CacheExtension is an extension that caches sampling decisions.
type CacheExtension interface {
	extension.Extension

	Cache[[]byte]
}

// Cache is a cache using a pcommon.TraceID as the key and any generic type as the value.
type Cache[V any] interface {
	// Get returns the value for the given id, and a boolean to indicate whether the key was found.
	// If the key is not present, the zero value is returned.
	Get(id pcommon.TraceID) (V, bool)
	// Put sets the value for a given id
	Put(id pcommon.TraceID, v V)
	// Delete deletes the value for the given id
	Delete(id pcommon.TraceID)
}

type Codec[V any] interface {
	// Encode encodes the value into a byte slice.
	Encode(v V) ([]byte, error)
	// Decode decodes the byte slice into a value.
	Decode(data []byte) (V, error)
}

func NewTypedCache[V any](codec Codec[V], cache Cache[[]byte]) (Cache[V])

Then the components interested in using the cache would essentially get a []byte cache from the system and then can safely convert it (with a codec) into a typed cache.

@Logiraptor
Copy link
Author

Once that codec idea is in place, it would be trivial to add an "easy button" codec like json which can work for most types via reflection.

@Logiraptor
Copy link
Author

@jpkrohling I went ahead and implemented that last idea to see how it looks: main...Logiraptor:opentelemetry-collector-contrib:logiraptor/decision-cache-extension

IMO it's OK but unfortunate that every use of the extension needs to pay an encoding cost even if only caching objects in memory. There will be a significant memory usage difference between the original implementation which is generic and this one which has to use []byte as the common denominator. Alternatively if the extension only works as a cache from traceID -> bool, then we wouldn't be able to reuse it in the load balancing exporter as you described.

What do you think I should do here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants