Discussion:
cygwin_conv_path POSIX->WIN_A conversion
Corinna Vinschen
2011-11-11 14:46:46 UTC
Permalink
Hi,

today it occured to me that there might be a bug in cygwin_conv_path.
Before I change something to the worse, I thought I should discuss this
first. Here's the problem:

The conversion CCP_POSIX_TO_WIN_A, from POSIX to multibyte native
Windows path, uses the internal sys_wcstombs function to convert the
WCHAR path to multibyte. That means, the destination charset is the
Cygwin charset. So, if Cygwin is operating in default UTF-8, the
resulting Windows path will be in UTF-8 as well.

In most languages, the Windows ANSI charset is NOT UTF-8. So, if the
path gets converted to UTF-8, using the resulting path in a Win32 ANSI
function (CreateFileA or whatever) will fail.

So I was wondering if the CCP_POSIX_TO_WIN_A function shouldn't be
changed so that it converts the pathname to the current ANSI or OEM
charset instead, depending on the value returned by the AreFileApisANSI
function.

I think this would be more correct than converting to the current Cygwin
multibyte charset. The downside is, that this *might* break backward
compatibility. However, if an application converts a Cygwin POSIX path
to a native Windows multibyte path, isn't it always for the sake of
calling a Win32 ANSI function or to submit the path to a native Windows
application?


Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
Eric Blake
2011-11-11 16:27:58 UTC
Permalink
Post by Corinna Vinschen
So I was wondering if the CCP_POSIX_TO_WIN_A function shouldn't be
changed so that it converts the pathname to the current ANSI or OEM
charset instead, depending on the value returned by the AreFileApisANSI
function.
Yes, that sounds right to me,
Post by Corinna Vinschen
I think this would be more correct than converting to the current Cygwin
multibyte charset. The downside is, that this *might* break backward
compatibility. However, if an application converts a Cygwin POSIX path
to a native Windows multibyte path, isn't it always for the sake of
calling a Win32 ANSI function or to submit the path to a native Windows
application?
Precisely for this reason - the only sane reason to convert to native is
to use the resulting string in native calls.
--
Eric Blake eblake-H+wXaHxf7aLQT0dZR+***@public.gmane.org +1-801-349-2682
Libvirt virtualization library http://libvirt.org
Daniel Colascione
2011-11-11 17:02:36 UTC
Permalink
Post by Eric Blake
Precisely for this reason - the only sane reason to convert to native is
to use the resulting string in native calls.
There's really little reason to use WIN_A anyway. The world is unicode these days.
Corinna Vinschen
2011-11-14 10:33:05 UTC
Permalink
Post by Eric Blake
Post by Corinna Vinschen
So I was wondering if the CCP_POSIX_TO_WIN_A function shouldn't be
changed so that it converts the pathname to the current ANSI or OEM
charset instead, depending on the value returned by the AreFileApisANSI
function.
Yes, that sounds right to me,
Post by Corinna Vinschen
I think this would be more correct than converting to the current Cygwin
multibyte charset. The downside is, that this *might* break backward
compatibility. However, if an application converts a Cygwin POSIX path
to a native Windows multibyte path, isn't it always for the sake of
calling a Win32 ANSI function or to submit the path to a native Windows
application?
Precisely for this reason - the only sane reason to convert to native is
to use the resulting string in native calls.
I'm just worried that this would open a can of worms.

If CCP_POSIX_TO_WIN_A always converts to ANSI/OEM, shouldn't
CCP_WIN_A_TO_POSIX always convert from ANSI/OEM? However, if the DOS
path has been entered on the Cygwin command line, it will very likely
not be given in the current ANSI/OEM CP, but rather in the Cygwin
charset.

Having said that, I'm wondering if we shouldn't leave the current
conversion alone and rather add new flags to cygwin_conv_path, so that
the *caller* can specify whether the conversion should be done using the
Cygwin or the Windows multibyte charset, or always UTF-8. Something
along these lines:

CCP_CYGWIN_CODESET = 0, <-- Do you have a better idea?
CCP_WIN32_ANSI_CP = 0x10,
CCP_WIN32_OEM_CP = 0x20,
CCP_UTF8_CODESET = 0x30,


Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
Corinna Vinschen
2011-11-18 09:56:41 UTC
Permalink
Post by Corinna Vinschen
Post by Eric Blake
Post by Corinna Vinschen
So I was wondering if the CCP_POSIX_TO_WIN_A function shouldn't be
changed so that it converts the pathname to the current ANSI or OEM
charset instead, depending on the value returned by the AreFileApisANSI
function.
Yes, that sounds right to me,
Post by Corinna Vinschen
I think this would be more correct than converting to the current Cygwin
multibyte charset. The downside is, that this *might* break backward
compatibility. However, if an application converts a Cygwin POSIX path
to a native Windows multibyte path, isn't it always for the sake of
calling a Win32 ANSI function or to submit the path to a native Windows
application?
Precisely for this reason - the only sane reason to convert to native is
to use the resulting string in native calls.
I'm just worried that this would open a can of worms.
If CCP_POSIX_TO_WIN_A always converts to ANSI/OEM, shouldn't
CCP_WIN_A_TO_POSIX always convert from ANSI/OEM? However, if the DOS
path has been entered on the Cygwin command line, it will very likely
not be given in the current ANSI/OEM CP, but rather in the Cygwin
charset.
Having said that, I'm wondering if we shouldn't leave the current
conversion alone and rather add new flags to cygwin_conv_path, so that
the *caller* can specify whether the conversion should be done using the
Cygwin or the Windows multibyte charset, or always UTF-8. Something
CCP_CYGWIN_CODESET = 0, <-- Do you have a better idea?
CCP_WIN32_ANSI_CP = 0x10,
CCP_WIN32_OEM_CP = 0x20,
CCP_UTF8_CODESET = 0x30,
Does nobody have an opinion here?


Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
Andy Koppe
2011-11-18 10:57:33 UTC
Permalink
Post by Corinna Vinschen
Post by Eric Blake
Post by Corinna Vinschen
So I was wondering if the CCP_POSIX_TO_WIN_A function shouldn't be
changed so that it converts the pathname to the current ANSI or OEM
charset instead, depending on the value returned by the AreFileApisANSI
function.
Yes, that sounds right to me,
Post by Corinna Vinschen
I think this would be more correct than converting to the current Cygwin
multibyte charset.  The downside is, that this *might* break backward
compatibility.  However, if an application converts a Cygwin POSIX path
to a native Windows multibyte path, isn't it always for the sake of
calling a Win32 ANSI function or to submit the path to a native Windows
application?
Precisely for this reason - the only sane reason to convert to native is
to use the resulting string in native calls.
I'm just worried that this would open a can of worms.
If CCP_POSIX_TO_WIN_A always converts to ANSI/OEM, shouldn't
CCP_WIN_A_TO_POSIX always convert from ANSI/OEM?
Yes, I think so.
Post by Corinna Vinschen
However, if the DOS
path has been entered on the Cygwin command line, it will very likely
not be given in the current ANSI/OEM CP, but rather in the Cygwin
charset.
A program that assumes something other than the Cygwin charset for
command line arguments is buggy.

Having said that, I assume the concern here is about pre-1.7 programs,
where assuming the ANSI/OEM codepage for command line arguments would
have been reasonable. However, such programs won't actually be using
CCP_POSIX_TO_WIN_A and CCP_WIN_A_TO_POSIX, since those were only
introduced with 1.7. Instead, they'll be using the deprecated
cygwin_conv_to_posix_path() and its relatives.

I understand those currently do the same as their cygwin_conv_path_t
equivalents, but that doesn't have to be that way. So how about if
those legacy functions keep current behaviour in an attempt to
maximise backward compatibility, whereas CCP_POSIX_TO_WIN_A and
CCP_WIN_A_TO_POSIX are changed to do what they say they do?
Post by Corinna Vinschen
Having said that, I'm wondering if we shouldn't leave the current
conversion alone and rather add new flags to cygwin_conv_path, so that
the *caller* can specify whether the conversion should be done using the
Cygwin or the Windows multibyte charset, or always UTF-8.  Something
 CCP_CYGWIN_CODESET = 0,       <-- Do you have a better idea?
 CCP_WIN32_ANSI_CP  = 0x10,
 CCP_WIN32_OEM_CP   = 0x20,
 CCP_UTF8_CODESET   = 0x30,
It's a possibility, but I find it a bit confusing and unnecessary.
Windows paths can already be converted to/from any required codeset by
going via the wide (i.e. WIN_W) version of the path and converting
with the appropriate choice of
MultiByteToWideChar/WideCharToMultiByte/mbstowcs/wcstombs.

Andy
Corinna Vinschen
2011-11-18 11:16:15 UTC
Permalink
Post by Andy Koppe
Post by Corinna Vinschen
Post by Eric Blake
Post by Corinna Vinschen
So I was wondering if the CCP_POSIX_TO_WIN_A function shouldn't be
changed so that it converts the pathname to the current ANSI or OEM
charset instead, depending on the value returned by the AreFileApisANSI
function.
Yes, that sounds right to me,
Post by Corinna Vinschen
I think this would be more correct than converting to the current Cygwin
multibyte charset.  The downside is, that this *might* break backward
compatibility.  However, if an application converts a Cygwin POSIX path
to a native Windows multibyte path, isn't it always for the sake of
calling a Win32 ANSI function or to submit the path to a native Windows
application?
Precisely for this reason - the only sane reason to convert to native is
to use the resulting string in native calls.
I'm just worried that this would open a can of worms.
If CCP_POSIX_TO_WIN_A always converts to ANSI/OEM, shouldn't
CCP_WIN_A_TO_POSIX always convert from ANSI/OEM?
Yes, I think so.
Post by Corinna Vinschen
However, if the DOS
path has been entered on the Cygwin command line, it will very likely
not be given in the current ANSI/OEM CP, but rather in the Cygwin
charset.
A program that assumes something other than the Cygwin charset for
command line arguments is buggy.
Having said that, I assume the concern here is about pre-1.7 programs,
where assuming the ANSI/OEM codepage for command line arguments would
have been reasonable. However, such programs won't actually be using
CCP_POSIX_TO_WIN_A and CCP_WIN_A_TO_POSIX, since those were only
introduced with 1.7. Instead, they'll be using the deprecated
cygwin_conv_to_posix_path() and its relatives.
I understand those currently do the same as their cygwin_conv_path_t
equivalents, but that doesn't have to be that way. So how about if
those legacy functions keep current behaviour in an attempt to
maximise backward compatibility,
But to maximize backward compatibility, they should use ANSI/OEM,
too, shouldn't they? It depends on the point from where you define
backward compatibility in this case.
Post by Andy Koppe
whereas CCP_POSIX_TO_WIN_A and
CCP_WIN_A_TO_POSIX are changed to do what they say they do?
Post by Corinna Vinschen
Having said that, I'm wondering if we shouldn't leave the current
conversion alone and rather add new flags to cygwin_conv_path, so that
the *caller* can specify whether the conversion should be done using the
Cygwin or the Windows multibyte charset, or always UTF-8.  Something
 CCP_CYGWIN_CODESET = 0,       <-- Do you have a better idea?
 CCP_WIN32_ANSI_CP  = 0x10,
 CCP_WIN32_OEM_CP   = 0x20,
 CCP_UTF8_CODESET   = 0x30,
It's a possibility, but I find it a bit confusing and unnecessary.
Windows paths can already be converted to/from any required codeset by
going via the wide (i.e. WIN_W) version of the path and converting
with the appropriate choice of
MultiByteToWideChar/WideCharToMultiByte/mbstowcs/wcstombs.
Along the lines of what Daniel wrote you're right, I guess. It's
a lot easier to implement only one choice, too.


Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
Andy Koppe
2011-11-18 12:48:10 UTC
Permalink
Post by Corinna Vinschen
Post by Andy Koppe
Post by Corinna Vinschen
Post by Eric Blake
Post by Corinna Vinschen
So I was wondering if the CCP_POSIX_TO_WIN_A function shouldn't be
changed so that it converts the pathname to the current ANSI or OEM
charset instead, depending on the value returned by the AreFileApisANSI
function.
Yes, that sounds right to me,
Post by Corinna Vinschen
I think this would be more correct than converting to the current Cygwin
multibyte charset.  The downside is, that this *might* break backward
compatibility.  However, if an application converts a Cygwin POSIX path
to a native Windows multibyte path, isn't it always for the sake of
calling a Win32 ANSI function or to submit the path to a native Windows
application?
Precisely for this reason - the only sane reason to convert to native is
to use the resulting string in native calls.
I'm just worried that this would open a can of worms.
If CCP_POSIX_TO_WIN_A always converts to ANSI/OEM, shouldn't
CCP_WIN_A_TO_POSIX always convert from ANSI/OEM?
Yes, I think so.
Post by Corinna Vinschen
However, if the DOS
path has been entered on the Cygwin command line, it will very likely
not be given in the current ANSI/OEM CP, but rather in the Cygwin
charset.
A program that assumes something other than the Cygwin charset for
command line arguments is buggy.
Having said that, I assume the concern here is about pre-1.7 programs,
where assuming the ANSI/OEM codepage for command line arguments would
have been reasonable. However, such programs won't actually be using
CCP_POSIX_TO_WIN_A and CCP_WIN_A_TO_POSIX, since those were only
introduced with 1.7. Instead, they'll be using the deprecated
cygwin_conv_to_posix_path() and its relatives.
I understand those currently do the same as their cygwin_conv_path_t
equivalents, but that doesn't have to be that way. So how about if
those legacy functions keep current behaviour in an attempt to
maximise backward compatibility,
But to maximize backward compatibility, they should use ANSI/OEM,
too, shouldn't they?
Hmm, indeed. When programs use the legacy conversion functions to
interface with Windows ANSI APIs they would want the ANSI/OEM codepage
to be used.

But that conflicts with the use case you cited where a program is
converting a Windows path it got via the Cygwin command line. That's
part of a wider issue though: on 1.5, they could have passed such a
path straight to a Windows ANSI API, whereas on 1.7 it needs to be
converted from the Cygwin charset first. There's nothing that can be
done about that one, short of changing the program (at which point
adapting to Unicode APIs would be the sensible thing to do).

So yeah, I'd go with ANSI/OEM for all the conversion functions
actually, and accept that Windows paths on the command line need to be
handled with more care on 1.7.

(That reminds me again of the mkshortcut overhaul that I never got
round to completing ...)

Andy

Loading...