Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<regex>: Simplify regex_traits<_Elem>::translate(_Elem) #5209

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Dec 27, 2024

Simplifies regex_traits<_Elem>::translate(_Elem) to just return its only argument. In #5204 (comment), I voiced my suspicion that the current implementation essentially does just that in a very complicated and expensive way. I verified this now by running the following program that tested 902 locales available on my machine (no output to stdout produced by the program):

#include <iostream>
#include <locale>
#include <regex>
#include <string>
#include <vector>

#include <windows.h>

using namespace std;

BOOL add_locale(LPWSTR str, DWORD flags, LPARAM param) {
    vector<string>& vec = *reinterpret_cast<vector<string>*>(param);
    int size            = WideCharToMultiByte(CP_ACP, 0, str, -1, nullptr, 0, nullptr, nullptr);
    if (size) {
        vector<char> narrow(static_cast<size_t>(size), '\0');
        if (WideCharToMultiByte(CP_ACP, 0, str, -1, narrow.data(), size, nullptr, nullptr)) {
            vec.push_back(narrow.data());
        }
    }
    return TRUE;
}

int main() {
    vector<string> locales;
    locales.push_back("C");
    EnumSystemLocalesEx(&add_locale, 0, reinterpret_cast<LPARAM>(&locales), nullptr);

    for (const string& loc_name : locales) {
        const locale loc(loc_name);
        regex_traits<char> traits;
        regex_traits<wchar_t> wtraits;
        traits.imbue(loc);
        wtraits.imbue(loc);

        for (unsigned int i = 0; i <= 0xff; ++i) {
            try {
                const char x = (char) (unsigned char) i;
                if (traits.translate(x) != x) {
                    cout << loc_name << i << '\n';
                }
            } catch (const length_error&) {
            }
        }

        for (unsigned int i = 0; i <= 0xffff; ++i) {
            try {
                const wchar_t x = (wchar_t) i;
                if (wtraits.translate(x) != x) {
                    cout << loc_name << i << '\n';
                }
            } catch (const length_error&) {
            }
        }
    }
    return 0;
}

Even so, this PR still introduces a minor behavior change: The previous implementation can throw length_error("string too long") when it is passed a char that isn't a valid character in the locale's encoding (e.g., 0x80 in locales using UTF-8 encoding). But I think that the old behavior is undesirable anyway, as it makes the regex engine always fail with an exception in regex_constants::collate mode when a locale using UTF-8 encoding is imbued and the regex engine is applied to strings containing non-ASCII characters.

@muellerj2 muellerj2 requested a review from a team as a code owner December 27, 2024 15:07
Copy link
Contributor

@frederick-vs-ja frederick-vs-ja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, it seems that we took too much time to look into this.
[re.traits]/4 simply specifies that an implementation-provided regex_traits<C>::translate just returns the argument. See also https://en.cppreference.com/w/cpp/regex/regex_traits/translate.

@muellerj2
Copy link
Contributor Author

I put so much work in this mainly because I was worried about mix-and-match scenarios, not because I consider the old implementation correct. Think of the scenario that the implementations actually commonly produce different results and regex parser and matcher pick up different implementations or some other strange combination. Then this change had the potential to actually degrade the regex engine in such a mix-and-match scenario and lead to regex bugs that are difficult to understand and reproduce.

@StephanTLavavej StephanTLavavej added the enhancement Something can be improved label Jan 4, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jan 4, 2025
@StephanTLavavej StephanTLavavej added the regex Everyone's favorite header label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Something can be improved regex Everyone's favorite header
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

3 participants