Ticket #11 (closed defect: fixed)
Unicode support
| Reported by: | edemaine@… | Owned by: | xi |
|---|---|---|---|
| Priority: | normal | Component: | pyyaml |
| Severity: | normal | Keywords: | |
| Cc: |
Description
I would like to bring up two issues with Unicode support in PyYAML's emitter. First, it emits a type annotation of !!python/unicode whenever emitting a unicode string that can be encoded in ASCII:
>>> print yaml.dump(u'Fran\xe7ais') "Fran\xE7ais" >>> print yaml.dump(u'hello') !!python/unicode 'hello'
I assume this is to force the value to be a unicode string when read back in. However, it makes for rather ugly files. In my case, and I imagine many others, I really don't care whether a string is stored as a 'str' or as a 'unicode' object in Python. And in YAML, the native string type is Unicode anyway. So it seems strange to have this distinction at the level of the YAML file. On the other hand, I understand the desire to have yaml.load(yaml.dump(x)) == x. Perhaps this should be another configuration option? (Of course, I could just convert my ASCII-encodable unicode objects to str objects...)
The second issue is that the emitter escapes non-ASCII characters even when all characters are printable (according to 'c-printable' in the YAML spec) when using an encoding (UTF8) that supports such characters. I don't find this as elegant as could be. Instead of the "Fran\xE7ais" output above, I would have hoped for the UTF8-encoded byte string Fran\xc3\xa7ais\n.
I guess this is as stylistic an issue as the previous one. It makes me wonder again whether there should be a Style object that can specify various emitting options, instead of many keyword arguments...
Attachments
Change History
comment:2 Changed 7 years ago by edemaine@…
Wow, that was a fast response. I didn't realize that's what the Safe line of dumpers did; thanks. And I obviously didn't realize the allow_unicode option; exactly what I wanted. Thanks so much!
The only thing more I could hope for is documentation of all these features (other than reading through the code). Is this in process? Can I help?
comment:3 Changed 7 years ago by xi
Well, I'm writing the docs now, check PyYAMLDocumentation. But it's just a rough draft.
As I'm not a native speaker, writing English prose is a PITA for me and the result is mediocre, so any help will be greatly appreciated. If you find a mistake or an unclear expression, feel free to fix it. Well, I would be glad if someone wrote the docs for me, but it's not going to happen. :)
Anyway, you don't need to check it now since I'm modifying it. But if you are willing to review it later, I would really appreciate it.
comment:4 Changed 7 years ago by edemaine@…
I have been reading that documentation, and it seems well written. (But I'm also happy to review it--send me email when you would like me to.) It just doesn't yet describe all of the features (particularly all the options), which I can understand :-). Some documentation about the design of the system would be helpful too, in particular, which classes do what, so it's clearer how to extend/modify.
P.S. allow_unicode is working great.
comment:5 Changed 7 years ago by edemaine@…
- Status changed from closed to reopened
- Resolution fixed deleted
I found a bug with allow_unicode = True:
>>> yaml.load(yaml.dump(u'\udd00'))
u'\udd00'
>>> yaml.load(yaml.dump(u'\udd00',allow_unicode=True))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/__init__.py", line 59, in load
loader = Loader(stream)
File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/loader.py", line 34, in __init__
Reader.__init__(self, stream)
File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/reader.py", line 114, in __init__
self.determine_encoding()
File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/reader.py", line 167, in determine_encoding
self.update(1)
File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/reader.py", line 201, in update
self.check_printable(data)
File "/toc/home/edemaine/Packages/lib/python2.5/site-packages/yaml/reader.py", line 176, in check_printable
'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #xdd00: special characters are not allowed
in "<string>", position 0
I believe the offending lines are 962-964 of emitter.py (Emitter.write_double_quoted):
if ch is None or ch in u'"\\\x85\u2028\u2029\uFEFF' \
or not (u'\x20' <= ch <= u'\x7E'
or (self.allow_unicode and ch > u'\x7F')):
Compare this with line 169 of reader.py (Reader.NON_PRINTABLE):
NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')
The latter is consistent with 'c-printable' in the YAML spec (except that it doesn't include #x10000-#x10FFFF--no support for 32-bit?). The former only seems to support 8-bit unicode properly...

You are right about me wanting type(yaml.load(yaml.dump(x))) to be equal to type(x). Still it can be easily overridden. The easiest way is to use safe_dump:
safe_dump is "safe" because it produces only standard YAML tags, no !!python/something tags are emitted. If you still want to use dump, you may change the unicode representer:
You might need to change the str representer too, but the corresponding code will be longer. Check SafeRepresenter.represent_str.
The second issue is already addressed, try:
The default is to escape non-ASCII characters because they will produce garbage in non-utf8 terminals.
The latter issue is stylistic, but the former is definitely not a stylistic issue. Different tags imply that the corresponding scalar nodes are different while the scalar style does not affect equality of nodes. You may be right about some kind of a Style object, but I need more use cases before introducing it.
I'm closing the ticket, but feel free to reopen it if you feel your issues are not completely solved.