Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: System.Linq.Shuffle() #111221

Open
andanteyk opened this issue Jan 9, 2025 · 7 comments
Open

[API Proposal]: System.Linq.Shuffle() #111221

andanteyk opened this issue Jan 9, 2025 · 7 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Linq untriaged New issue has not been triaged by the area owner

Comments

@andanteyk
Copy link

Background and motivation

Shuffle is a universal operation, such as in games and preprocessing for machine learning.

LINQ includes APIs for changing the order, such as OrderBy() and Reverse().
However, there is no dedicated API for shuffling an IEnumerable<T> yet.
(There is System.Random.Shuffle() for Span<T>, but if you want to take an arbitrary IEnumerable<T> sequence, you'll probably need a custom implementation.)

On StackOverflow, we often see examples of shuffling implemented with code like .OrderBy(_ => Guid.NewGuid()).
However, this implementation is inefficient, and there are more efficient ways to shuffle.

Therefore, I propose to implement System.Linq.Shuffle().

API Proposal

namespace System.Linq;

public static class Enumerable 
{
    public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source) { }
    public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, Random randomSource) { }
}

API Usage

var songs = ["Lorem", "ipsum", "dolor", "sit", "amet"];
foreach(var song in songs.Shuffle())
{
    Play(song);
}

// pick 3 songs with no duplicates
var threeSongs = songs.Shuffle().Take(3).ToArray();

Alternative Designs

Possible considerations:

  • Make the return type a dedicated type, for example IOrderedEnumerable<T> for OrderBy().
  • An overload that takes a Random instance, which allows for reproducible shuffling, but requires a fixed implementation.

Besides, Shuffle() allows for an easy (though not optimal) solution to #102229.

var noDuplicationGetItems = array.Shuffle().Take(n);

Risks

No response

@andanteyk andanteyk added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jan 9, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jan 9, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

@LeaFrock
Copy link
Contributor

LeaFrock commented Jan 9, 2025

See #78419 (comment), #73864 (comment).

Shuffle depends highly on the random algorithm instead of IEnumerable<T>. At least it's easy to add an extension for yourself.

@andanteyk
Copy link
Author

Even if you say "you can easily implement it yourself," it is not efficient, and if someone (including me) has implemented it themselves many times, I think that is a reason to include it in the official API.
OrderBy() is not the optimal implementation, nor is it the optimal use. It is for "sorting".

I don't think it's a problem even if Random.Shuffle() is used internally.
(Strictly speaking, an inside-out implementation would be more efficient than the method of "copying to an internal array and then performing an inplace shuffle".)
Even though Random.Shuffle() exists, additional implementation is currently required to support IEnumerable<T>. (It will probably be necessary to make it T[] first.)
Therefore, I think there is a reason for its existence.

If "Random, not the collection, should be responsible", would that be Random.Shared.Shuffle(IEnumerable<T> source)?
Still, I personally prefer the LINQ style.

@LeaFrock
Copy link
Contributor

LeaFrock commented Jan 9, 2025

Random, not the collection, should be responsible

Yes, the key point is how to implement the shuffle effect with randomness.

additional implementation is currently required to support IEnumerable. (It will probably be necessary to make it T[] first.)

The IEnumerable is not the concept of 'collection' (without Count, which ICollection has; without indexer, which IList has). How to shuffle a data set by Random without an expectable count and an accessible indexer? The problem is not so easy as you think.

it is not efficient

I agree but the similar problems aren't always the responsibility of runtime.

As I mentioned above, the upper level developers/libraries can provides some encapsulation practice based on BCL.

You can add the following extension method, which I think covers 99% usage scenes.

        public static void Shuffle<T>(this IList<T> source)
        {
            if (source is T[] array)
            {
                Random.Shared.Shuffle(array);
                return;
            }
            if (source is List<T> list)
            {
                Random.Shared.Shuffle(CollectionsMarshal.AsSpan(list));
                return;
            }
            ShuffleSlow(source);

            static void ShuffleSlow(IList<T> values)
            {
                int n = values.Count;
                var rand = Random.Shared;
                for (int i = 0; i < n - 1; i++)
                {
                    int j = rand.Next(i, n);

                    if (j != i)
                    {
                        (values[j], values[i]) = (values[i], values[j]);
                    }
                }
            }
        }

Even so, you should be realized:

  1. the Shuffle modify the source immediately, which LINQ APIs doesn't do so.
  2. CollectionsMarshal.AsSpan(list) has some risks if other thread is updating the list.

So you can see, the answer depends on your use cases. You have to change the implement as you require.

@eiriktsarpalis
Copy link
Member

How could this be implemented without bias in cases where the length of the enumerable isn't known ahead of time or could potentially be infinite?

@eiriktsarpalis
Copy link
Member

To answer my own question, this is typically addressed using reservoir sampling. This necessarily introduces bias in the generated permutations, however it might be a good-enough compromise for some use cases.

One potential implementation could involve implementing the classical shuffling algorithm for sources implementing IList<T>/IReadOnlyList<T> and then falling back to reservoir sampling for everything else. The problem with such an approach is that it violates LSP, two enumerables with identical elements would be shuffled using different algorithms with different statistical properties depending on what other interfaces they happen to implement.

@andanteyk
Copy link
Author

As far as I understand, reservoir sampling is an algorithm that extracts elements partially, so it may be difficult to apply it to the current application, which requires shuffling the entire array.
In addition, the result cannot be determined until all elements are enumerated (at least once), so I don't think it can handle infinitely long IEnumerable<T>.
(I'm sorry if I'm wrong, please point it out.)

In my opinion, it would be implemented like Reverse() does now, by first materializing the array with ToArray(), then shuffling it, and returning it in order.
OrderBy() seems to have a separate array of the original data obtained with ToArray() and a sorted index.

Of course, there may be better ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Linq untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

3 participants