Why is the output of case folding different from the expected?

use feature 'fc';
use utf8;
use open ':std', ':encoding(UTF-8)';

my $char = "İ";
my $folded = fc($char);

print "fc(İ): $folded\n";
print "Hex values: ", join(" ", map { sprintf "%02X", ord($_) } split //, $folded), "\n";

The above code is generated by chatGPT, and chatGPT said the output should be:

fc(İ): i
Hex values: 69

but on my laptops (macOS, Perl v5.40.2) and (Ubuntu, Perl v5.38.2) both output:

fc(İ): i̇
Hex values: 69 307

I'm just three days into learning Perl and I'm curious whether any additional configuration is necessary.

I tried using Perl’s fc() function to case-fold the character İ. I expected it to return "i", but instead it returned "i̇" (an i with a combining dot above). I’m confused because I thought fc would behave like Unicode case folding. Is there something I need to configure?

Answer

You have

U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE

Unicode 14.0.0 defines its fold as follows:

0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE

Since there are two entries, it means there are two ways to fold this code point.

It has one that’s larger than the original string, indicated by “F” for “full case folding”. This is the preferred where possible. Solutions that can’t handle this fall back to the non-growing alternative denoted by a “S” for “simple case folding”.

But here we have “T” instead of “S” because the character in question is very exceptional. Of “T”, the standard says:

special case for uppercase I and dotted uppercase I

For non-Turkic languages, this mapping is normally not used.

For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters. Note that the Turkic mappings do not maintain canonical equivalence without additional processing. See the discussions of case mapping in the Unicode Standard for more information.

Perl supports foldings that are larger than the original, and Perl's fc is not language aware, so it returns the following as mandated by the Unicode spec:

U+0069 LATIN SMALL LETTER I
U+0307 COMBINING DOT ABOVE

Why is the output of case folding different from the expected?

Answer

Related Articles

Is there a way to add function to a list as list element without executing the function when the list is called in flutter?

Selecting a folder in the Google Picker with the drive.file scope

Is const T& effectively the same as T&& with std::forward for rvalue argument when be passed into another function?

How to make join view in C++ preserving random access?

Slow npm install on self-hosted GitHub runners in AWS Cloud (Windows)

Add external project in nuget .nuspec file