Improving Performance of DiffJ/JRuby

DiffJ is nearly ready for release, but I’ve not been content with the performance, which is significantly slower with the JRuby implementation than the pure Java version.

My changes were based on the recommendations on the JRuby wiki.

Before any optimization, a test run of diffj against a pair of Java files ran with the times:

user    :   4.84
system  :   0.19
cpu     : 184.20
total   :   2.72

Following the suggested changes, I added the -client argument to the Java process, which resulted in:

user    :   4.99
system  :   0.21
cpu     : 184.80
total   :   2.80

So performance actually worsened.

Next was passing the argument -Djruby.compile.mode=OFF arguments to the Java process:

user    :   4.92
system  :   0.17
cpu     : 184.80
total   :   2.75

Again, performance worsened slightly.

With both the -client and -Djruby.compile.mode=OFF arguments, performance was still down, as one would expect now:

user    :   4.92
system  :   0.18
cpu     : 186.00
total   :   2.73

So then I more carefully went through my code, looking at the time to require each file, and found two salient problems.

The first is that I was dynamically creating several hundred methods in the RIEL ANSIColor class, for each combination of decorations and foreground and background colors (such as “bold_red_on_white”). I refined that code, the RIEL Log and Loggable classes, and the extensions to the Ruby String class to dynamically create the color methods as necessary.

Thus the dynamic definition of the method “bold” is truly a usage of the decorator pattern.

That resulted in a slight improvement:

user    :   4.68
system  :   0.22
cpu     : 186.80
total   :   2.62

I wondered about the overhead of Rubygems, so I removed RIEL as a gem, and instead added it within the DiffJ source tree. Performance improved significantly:

user    :   3.74
system  :   0.20
cpu     : 184.80
total   :   2.13

Combining all of the above resulted in the best performance:

user    :   3.62
system  :   0.19
cpu     : 181.20
total   :   2.10

That is acceptable to me, and I’ll be releasing version 1.3.0 of DiffJ soon. Of course, it’s on Github here, so feel free to download it and build it.

Formatting Code (Properly == Consistently)

The first rule of formatting is that code should be consistent. That means that the code should be consistent within itself — that is, within its files — and across the code base.

And consistent with the language itself, or more accurately, the framework and libraries of the language.

Programmers should not try to make a statement with the format of their code. I’ve seen hideous, oddly-formatted code where it seems that the programmers wanted simply to look different, since their alternative style made no improvement in the legibility of the code. Sure, it’s different. But the problem is: it’s different. There is cognitive friction in that I’m reading code that doesn’t look like the other code in that language. It doesn’t look (exactly) like a JDK class. Or a Ruby library. Or part of the C++ STL. It’s different.

One of my primary tenets of programming (and a lot of other things) is: change only one variable at a time.

Nonstandard formatting violates that principle because it doesn’t take the predominant formatting style of a language as a constant, and instead treats it is a variable. So as a reader of that code I must grapple with two variables: the code itself, which is new (thus a variable) to me, and its nonstandard format, also new/different to me, not matching my expectations of how code in the language is supposed to look.

Yes, I know I can just reformat the code. And we can read files with a chain of input stream objects instead of, oh, something like IO.readlines. In the words of Alfred North Whitehead: “Civilization advances by extending the number of important operations which we can perform without thinking about them.”

Said inversely, this means that our experience worsens with the number of variables we have to think about, with which I’m sure any programmer would agree. And that includes code formatting.

Adding setup and teardown to RubyUnit Test Suite

The tests for pvn raise an interesting question to me. The pvn subcommands that wrap/extend the svn subcommands process the svn output, so the tests need svn output to run.

I’ve considered doing mock objects for svn, but that became confounding, where I was essentially mocking all of the svn subcommands that I need. So I’m leaning in the direction of doing operations against a “real” Subversion repository. On a 234M Subversion repository, it takes around 0.4 seconds to do a backup (cp -r), when both the source and the target locations are on the SSD in my machine.

So it’s feasible for the test sequence to be:

  • back up the svn repository (0.4 seconds)
  • check out the svn repository checked out to working copy (4 seconds)
  • working copy modified, with added, changed and deleted files and directories (< 10 seconds for all tests)
  • svn repository restored to backed up version (1 second)

I’m still considering how to go about this, but in the meantime I’ve written the following for adding suite-wide setup and teardown for the pvn base testcase. Note that this is on a class:

require 'runit/testcase'

module PVN
  class TestCase < RUNIT::TestCase
    include Loggable

    WIQUERY_DIRNAME = "/Programs/wiquery/trunk"

    class << self
      def setup
        @@orig_location = Pathname.pwd
      end

      def teardown
        Dir.chdir @@orig_location
      end

      def suite
        @@cls = self

        ste = super

        def ste.run(*args)
          @@cls.setup
          super
          @@cls.teardown
        end
        ste
      end
    end

sort{ |a, b| a <=> b }.ing with Ruby

One idiom from Perl that I’ve missed with Ruby is the ability to chain comparisons together, such as:

my @a = qw{ this is a test };

$, = ", ";

print sort { substr($a, -1) cmp substr($b, -1) || length($a) <=> length($b) } @a;
print "\n";

Which results in the output:

a, is, this, test

In Ruby, it’s a little more complicated, since Perl evaluates a zero as false, but Ruby does not. However, the nonzero? method for all Ruby Numeric objects essentially performs this conversion, for use in a boolean evaluation, returning nil if it is zero, and the number otherwise. So in Ruby, the above code would be:

a = %w{ this is a test }

puts a.sort { |a, b| (a[-1] <=> b[-1]).nonzero? || a.length <=> b.length }.join(", ")

One additional note: if you’re using this in a spaceship method (“”) for the Comparable interface, remember that it must return a numeric value, so if you chain evaluations together, the final statement should be zero, since all previous evaluations were nil (meaning that they were equal). This bit me during some recent DiffJ work, and here is an example of a corrected method:

class Java::net.sourceforge.pmd.ast::Token
  include Comparable, Loggable

  # ...

  def <=> other
    (kind <=> other.kind).nonzero? ||
      (image <=> other.image).nonzero? ||
      0
  end

That’s DiffJ opening the PMD token Java class and adding the Ruby Comparable interface to it, so tokens can be sorted in Ruby collections.

On that note, DiffJ is in rough beta status now. I’m using it for my work (refactoring and cleaning up legacy Java code), and just corrected a glitch in the Token code, ironically enough, for supporting usage in Hash objects. I’d neglected to implement the eql? method, erroneously thinking that Hash uses the Comparable code.

With that fix, the JRuby implementation of DiffJ produces the same output as the Java implementation. It’s somewhat slower, so I’ve been investigating AOT compiling of it, but that doesn’t seem to have much of an effect.

I just realized that another feature from Perl that I’ve missed (and until writing that code above, hadn’t used for 10 years) is defining the array separator with the “$,” variable. Similar to that, my RIEL library modifies the to_s method of an Array to output “, ” between elements for output, since the default is to have no space between elements.