Modify

Ticket #12 (closed enhancement: fixed)

Opened 8 years ago

Last modified 8 years ago

PyYAML is slow

Reported by: edemaine@… Owned by: xi
Priority: normal Component: pyyaml
Severity: normal Keywords:
Cc:

Description

Here are two simple wall-clock timings comparing PyYAML to PySyck on a Pentium 4 2.8GHz with 1MB cache and 1GB RAM:

$ wc file1.yaml
 2036  8767 59154 file1
$ test.py file1.yaml
0:00:00.001419 to read the YAML via Syck
0:00:04.029627 to read the YAML via PyYAML
$ wc file2.yaml
  8949  35105 317342 file2
$ test.py file2.yaml
0:00:00.001564 to read the YAML via Syck
0:00:19.288912 to read the YAML via PyYAML

I do not expect PyYAML to be terribly competitive with Syck: the language barrier is big, and PyYAML is written with a higher level of abstraction. But I was surprised to see a factor of 12,000 difference. I wonder if a bit of profiling and tuning might reduce this gap to just a couple of orders of magnitude (100x) instead of four? Personally, 19 seconds to read a 0.3 meg file is too slow for my application, so I'll have to switch back to Syck for now, unfortunately. Just food for thought...

Attachments

test.py Download (340 bytes) - added by edemaine@… 8 years ago.
A simple Syck vs. PyYAML driver
CSAIL.yaml Download (246.5 KB) - added by edemaine@… 8 years ago.
A large YAML file (slightly culled to fit on Trac)
test.2.py Download (772 bytes) - added by edemaine@… 8 years ago.
New performance test script
test.3.py Download (779 bytes) - added by edemaine@… 8 years ago.
Corrected test script

Change History

comment:1 Changed 8 years ago by xi

  • Status changed from new to assigned

It is expected for C vs Python, but I'm too surpised by the factor of the difference. I usually get about 200x difference on simple tests. You may attach your files and the script so I can check them.

You may try to use psyco, it might get you about 1.5-5.0 speed up:

>>> from yaml.reader import Reader
>>> from yaml.scanner import Scanner
>>> from yaml.parser import Parser
>>> from yaml.composer import Composer
>>> from yaml.constructor import Constructor
>>> from psyco import bind
>>> bind(Reader)
>>> bind(Scanner)
>>> bind(Parser)
>>> bind(Composer)
>>> bind(Constructor)

The real solution is, of course, to rewrite the code to C. It's planned, but don't expect it too soon.

comment:2 Changed 8 years ago by edemaine@…

OK, here is a sample file on the larger size (8961 lines, 301,229 bytes), and a simple driver script generating output similar to the last example above.

Changed 8 years ago by edemaine@…

A simple Syck vs. PyYAML driver

Changed 8 years ago by edemaine@…

A large YAML file (slightly culled to fit on Trac)

comment:3 Changed 8 years ago by xi

Sorry for the trac spam :(. I'll try to deal with it somehow.

On the bright side, I've started the LibYAML project, which will eventually allow to close this bug. :)

comment:4 Changed 8 years ago by xi

  • Status changed from assigned to closed
  • Resolution set to fixed

The libyaml bindings are now usable (though not as fast as possible).

comment:5 Changed 8 years ago by edemaine@…

I finally got to try the LibYAML bindings of PyYAML. In case you're curious, here is a repeat of the simple test from before. The improvement so far is about a factor of 10 (without Psyco), but still 3 more orders of magnitude to get down to Syck speed.

$ python test.py CSAIL.ycard
0:00:00.001437 to read the YAML via Syck
0:00:13.661756 to read the YAML via PyYAML
0:00:01.181506 to read the YAML via PyYAML/LibYAML

Changed 8 years ago by edemaine@…

New performance test script

comment:6 Changed 8 years ago by xi

There is a problem in your test code in the line:

  cards = syck.load_documents (open (sys.argv[1]))

The function load_documents is a generator, so it does not really load the documents. You should replace it with

  for card in syck.load_documents (open (sys.argv[1])):
      pass

Please post the updated benchmarks :) PyYAML/LibYAML is 2-3 times slower than PySyck, probably because of Pyrex and PyYAML code overhead. I'm going to reduce overhead by replacing all Pyrex and some Python code with pure C.

You may also run

  yaml.CLoader (open (sys.argv[1])).raw_parse()

to check pure LibYAML perfomance.

comment:7 Changed 8 years ago by edemaine@…

Whoops, you are right! Sorry about that. Now they are within a factor of 2 as you state (I am actually using PySyck):

$ python test.py CSAIL.ycard
0:00:00.643884 to read the YAML via Syck
0:00:13.676710 to read the YAML via PyYAML
0:00:01.201301 to read the YAML via PyYAML/LibYAML

Nice work! Looking forward to even more optimizations.

Changed 8 years ago by edemaine@…

Corrected test script

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
The resolution will be deleted. Next status will be 'reopened'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.