A Day Refactoring Code, Part 4: Performance

This post is part of the series about refactoring code, and this post follows part 3.

With the functionality set, let’s focus on performance, since this class is heavily used, being at the core of DoctorJ. The main function in that regard is checkCurrentWord(), which we can immediately see uses a StringBuffer instead of the more performant StringBuilder.

But before we start doing any performance tuning, let’s get a rough estimate of how the current performance is. Running just TestParsingSpellChecker takes 14.735 seconds, and all 300+ tests take 27.098 seconds.

Here’s the outline of the method, in terms of how a word is extracted for checking:

    protected void checkCurrentWord() {
        StringBuffer word = new StringBuffer();
        word.append(currentChar());
        
        int startingPosition = this.pstr.getPosition();
        boolean canCheck = true;

        this.pstr.advancePosition();

        while (hasMore()) {
            char ch = currentChar();
            ... check by character ...
        }

        if (canCheck && this.pstr.getPosition() - startingPosition > 1) {
            checkWord(word.toString(), startingPosition);
        }
    }

Replacing StringBuffer with StringBuilder reduces the time to 11.457 seconds for the single test, although the time for all tests doesn’t change.

The most obvious problem with this code is that we are violating DRY with how the substring is extracted. The starting position is redundant with the word being built, so we can just remove the word buffer, and store the substring by its starting position, extracting it from pstr at the end of processing. (Note that java.lang.String is optimized for substring, with how it uses a char[] and offset).

We’ll need a new method in PString to get a substring at a position other than the current one, and that can be quickly coded and tested:

    /**
     * Returns a substring from the given position, of <code>num</code>
     * characters.
     */
    public String substring(int position, int num) {
        return this.str.substring(position, this.position + num);
    }

See the problem there? We have clashing variables, this.position and position. I’ve changed back and forth from _var to this.var, and am undecided.

Back to TestPSC, the new method gets significantly cleaned up. We have two integers, startingPosition and nChars, for when and if the substring needs to be extracted. the test now runs in 8.821 seconds.

Okay, those numbers are misleading, since that is just the Gradle execution time, which includes startup, compilation, etc., and from looking at the unit tests, there are no performance tests — the unit tests are mostly simple tests consisting of one or just a few lines of input data for the spell checker.

Let’s write a unit test, and get some data for it. We’ll use our own dictionary as input, getting a subset of its lines, and scrambling a few to be misspelled words:

    ~doctorj% perl -lane 'next if rand() < 0.95; $_ = reverse($_) if rand() < 0.2; print' etc/words.en_US > src/test/resources/wordlist.txt

Note that that goes into src/test/resources, which gets added to the build subdirectory by Gradle, and thus is in the classpath, to be read from our test code (with a couple of IJDK methods):

    public void testPerformance() {
        InputStream in = getClass().getClassLoader().getResourceAsStream("wordlist.txt");
        List<String> lines = InputStreamExt.readLines(in);

        // wrap as a Javadoc comment block:
        lines.add(0, "/** \n");
        lines.add("*/\n");

        String comment = StringExt.join(lines, "\n");
    }

First, let’s look at what the performance was with the old version of PSC, from 5 revisions back:

~doctorj% git checkout master~5 src/main/java/org/incava/text/spell/ParsingSpellChecker.java

With our test:

    public void testPerformance() {
        InputStream in = getClass().getClassLoader().getResourceAsStream("wordlist.txt");
        List<String> lines = InputStreamExt.readLines(in);

        // wrap as a Javadoc comment block:
        lines.add(0, "/** \n");
        lines.add("*/\n");

        String comment = StringExt.join(lines, "\n");

        long start = System.currentTimeMillis();
        
        tcsc.check(comment);

        long end = System.currentTimeMillis();

        long duration = end - start;
        System.err.println("# words: " + lines.size() + "; duration: " + duration);
    }

Alas, there’s nearly no difference:

previous code: # words: 7377; duration: 71273
new code: # words: 7377; duration: 69576

But it’s a slight improvement, and the code is much simpler. We also have a new class, PString, for usage in other scenarios in which a string is quasi-parsed.

Finally, we’ll disable our performance tests from the unit tests, check everything in, and look for the next thing to fix.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s