[Svnmerge] Unicode in log messages

Fri Oct 9 12:00:03 PDT 2009

Please keep replies on list...

Benson Margulies wrote:
> The point is that it only uses the encoding to write the file. It reads
> the bytes from the log raw, and pushes them into the codec to write them
> into the file. Thus, it is assuming that the input is UTF-8, and asking
> for the output to be in the default locale. That's how the codecs work.
> It isn't using a codec to convert from input, only to convert the output.

I'm sorry Benson, but I believe you are operating under some
fundamental misconceptions... Of course it has to use a codec to
convert from input ("input" here is the svn log output).

Any time one reads bytes that one knows are characters (as output by
svn log), one needs to apply a codec to the bytes to understand what
those characters are. You contradict yourself by saying that it is
assuming the input is UTF-8 -- UTF-8 is just another codec, no
different from other codecs except in the actual byte value(s) used to
represent characters. Assuming UTF-8 would indeed mean using a codec
to decode the input.

Here is what it is really doing:

def recode_stdout_to_file(s):
    [... if statement snipped ...]
    u = s.decode(sys.stdout.encoding)
    return u.encode(locale.getdefaultlocale()[1])

i.e. svnmerge.py is decoding the bytes of the svn log output using the
codec returned by sys.stdout.encoding. This may be UTF-8, but it may
be something else depending on your local platform and settings. There
is *no assumption* of UTF-8 here. Then it is encoding those characters
back into bytes (and eventually writing these bytes to a file), using
the codec returned by locale.getdefaultlocale()[1]. This encoding is
what svn expects in the content of files that it reads commit log
messages from via the -F parameter.

The possible error here is that our assumption of what encoding svn
uses when printing a log to stdout (i.e. sys.stdout.encoding) or what
encoding svn uses when reading a commit log file for creating a commit
message (i.e. locale.getdefaultlocale()[1]) is wrong. If either of
these assumptions is wrong, then yes, there is a problem that needs to
be fixed. It has nothing to do with "assuming" UTF-8.

> And this makes sense. It's completely wrong to assume that the svn log
> messages are in the current user's default locale locale encoding. It
> makes some sense that users would want to edit a file in their current
> encoding, it just doesn't always work.

Huh? Do you have some evidence that svn, when writing a commit log to
standard output, does not write the data in the encoding specified by
the python sys.stdout.encoding value? If so, great -- please provide
such evidence and a patch with your fix.

Cheers,
Raman