Stati of Projects

I’ve been jumping around from one project to another on an as-needed basis, and here is the current status of each.

RIEL – I’ve been updating the colorizing code, adding the functionality to use extended color codes on ANSI terminals, instead of the default 10.

Glark – In the midst of a major rewrite, for both code purity and functionality. The main changes so far are extensions to the path/file arguments, bringing in some functionality from find. Glark will also (with the imminent integration of RIEL) support extended colors.

PVN – This project was in heavy development until a month ago, when it went into a state of waiting for updates to Glark, since it will be using Glark for its seek (searching) subcommand.

DiffJ – This project I rewrote in JRuby, but that was a little too slow for a command-line application –  even heavily tweaked, I couldn’t get the startup time under two seconds. So I re-rewrote it in Java, and intend to revisit it to add more intelligent code comparisons, such as understanding that “if (a == b) foo();” is the same as “if (a == b) { foo(); }”

IJDK – Mostly dormant at this point. When I’ve rewritten Java projects, I’ve tried to extract the generic code from them and add them to IJDK, but I haven’t been heavily involved with any of my Java projects lately.

Java-Diff – A while ago this was brought up to date to use generics, and was refactored for code clarity. There hasn’t been any reason to update it since then.

DoctorJ – Alas, this is dormant. Some of its warnings have been integrated into the Java compiler, such as mismatched parameter names. It still goes beyond that level of pedanticalness, so it is most suitable in the development of projects where documentation is paramount, such as APIs.

Related posts:

Advertisements

A Day Refactoring Code, Part 4: Performance

This post is part of the series about refactoring code, and this post follows part 3.

With the functionality set, let’s focus on performance, since this class is heavily used, being at the core of DoctorJ. The main function in that regard is checkCurrentWord(), which we can immediately see uses a StringBuffer instead of the more performant StringBuilder.

But before we start doing any performance tuning, let’s get a rough estimate of how the current performance is. Running just TestParsingSpellChecker takes 14.735 seconds, and all 300+ tests take 27.098 seconds.

Here’s the outline of the method, in terms of how a word is extracted for checking:

    protected void checkCurrentWord() {
        StringBuffer word = new StringBuffer();
        word.append(currentChar());
        
        int startingPosition = this.pstr.getPosition();
        boolean canCheck = true;

        this.pstr.advancePosition();

        while (hasMore()) {
            char ch = currentChar();
            ... check by character ...
        }

        if (canCheck && this.pstr.getPosition() - startingPosition > 1) {
            checkWord(word.toString(), startingPosition);
        }
    }

Replacing StringBuffer with StringBuilder reduces the time to 11.457 seconds for the single test, although the time for all tests doesn’t change.

The most obvious problem with this code is that we are violating DRY with how the substring is extracted. The starting position is redundant with the word being built, so we can just remove the word buffer, and store the substring by its starting position, extracting it from pstr at the end of processing. (Note that java.lang.String is optimized for substring, with how it uses a char[] and offset).

We’ll need a new method in PString to get a substring at a position other than the current one, and that can be quickly coded and tested:

    /**
     * Returns a substring from the given position, of <code>num</code>
     * characters.
     */
    public String substring(int position, int num) {
        return this.str.substring(position, this.position + num);
    }

See the problem there? We have clashing variables, this.position and position. I’ve changed back and forth from _var to this.var, and am undecided.

Back to TestPSC, the new method gets significantly cleaned up. We have two integers, startingPosition and nChars, for when and if the substring needs to be extracted. the test now runs in 8.821 seconds.

Okay, those numbers are misleading, since that is just the Gradle execution time, which includes startup, compilation, etc., and from looking at the unit tests, there are no performance tests — the unit tests are mostly simple tests consisting of one or just a few lines of input data for the spell checker.

Let’s write a unit test, and get some data for it. We’ll use our own dictionary as input, getting a subset of its lines, and scrambling a few to be misspelled words:

    ~doctorj% perl -lane 'next if rand() < 0.95; $_ = reverse($_) if rand() < 0.2; print' etc/words.en_US > src/test/resources/wordlist.txt

Note that that goes into src/test/resources, which gets added to the build subdirectory by Gradle, and thus is in the classpath, to be read from our test code (with a couple of IJDK methods):

    public void testPerformance() {
        InputStream in = getClass().getClassLoader().getResourceAsStream("wordlist.txt");
        List<String> lines = InputStreamExt.readLines(in);

        // wrap as a Javadoc comment block:
        lines.add(0, "/** \n");
        lines.add("*/\n");

        String comment = StringExt.join(lines, "\n");
    }

First, let’s look at what the performance was with the old version of PSC, from 5 revisions back:

~doctorj% git checkout master~5 src/main/java/org/incava/text/spell/ParsingSpellChecker.java

With our test:

    public void testPerformance() {
        InputStream in = getClass().getClassLoader().getResourceAsStream("wordlist.txt");
        List<String> lines = InputStreamExt.readLines(in);

        // wrap as a Javadoc comment block:
        lines.add(0, "/** \n");
        lines.add("*/\n");

        String comment = StringExt.join(lines, "\n");

        long start = System.currentTimeMillis();
        
        tcsc.check(comment);

        long end = System.currentTimeMillis();

        long duration = end - start;
        System.err.println("# words: " + lines.size() + "; duration: " + duration);
    }

Alas, there’s nearly no difference:

previous code: # words: 7377; duration: 71273
new code: # words: 7377; duration: 69576

But it’s a slight improvement, and the code is much simpler. We also have a new class, PString, for usage in other scenarios in which a string is quasi-parsed.

Finally, we’ll disable our performance tests from the unit tests, check everything in, and look for the next thing to fix.

A Day Refactoring Code, Part 3: Naming

“Naming is the essence of programming.” -Fred Brooks, The Mythical Man-Month

Continuing from part 2 of this series, let’s now revisit our names, now that we have a better understanding of the behavior here.

PosString hasn’t grown on me as a name, and taking a bit of inspiration from the GString in Groovy, let’s rename it PString, which suggests “Processable String” and “Positionable String”. We’ll also improve the field names “pos” and “len” by renaming them as “position” and “length”. “str” is also quite short, but is a common name for a string, whereas “pos” and “len” are less common.

Going back to ParsingSpellChecker (which I’ll refer to as PSC), PString now has all of its low-level string processing functionality. There are methods in ParsingSpellChecker that seem more specific to its domain, such as skipLink().

The edge condition here would be a method such as skipThroughWord(), which advances to the next character that is not a letter or digit. As the descriptions of the methods get longer, they are more likely to be too specific for usage in a general class.

So that’s it for PString. Let’s start migrating ParsingSpellChecker to use it. First is that we’ll keep the instance fields as public, and just switch the
references:

            this.pstr = new PString(str);
            this.str = pstr.str;
            this.len = pstr.length;

Through the code we replace “this.pos” with “this.pstr.position”, so again the low-level behavior is the same, just with ParsingSpellChecker being highly coupled to PString, for now.

We’ll do likewise with the len/length fields, so now our PSC methods have code such as:

    protected void consumeTo(String what) {
        while (this.pstr.position < this.pstr.length && this.pstr.position + what.length() < this.pstr.length && !this.str.substring(this.pstr.position).startsWith(what)) {
            ++this.pstr.position;
        }
    }

We’re midstream on this task, but given our status (show by running the alias for “git diff –stat HEAD”) we see that we have 52 changed lines of code, which is a reasonable amount of code for a checkin:

~doctorj% gitdfs
 .../org/incava/text/spell/ParsingSpellChecker.java |   52 +++++++++-----------
 1 files changed, 23 insertions(+), 29 deletions(-)

We check it in and get back to work:

~doctorj% git commit -m "Began transition for ParsingSpellChecker to use PString." .
[master 95db8c1] Began transition for ParsingSpellChecker to use PString.
 1 files changed, 23 insertions(+), 29 deletions(-)

We remove this.str references, replacing them with “this.pstr.str”, then began “flipping” the methods in PSC to call their PString equivalents, such as this change (diff) for currentChar():

     protected Character currentChar() {
-        return this.str.charAt(this.pstr.position);
+        return this.pstr.currentChar();
     }

We make the transition a bit easier by renaming “consume()” to “consumeFrom()”, matching our PString method advanceFrom().

We go through the PSC code, replacing usage of PString fields with accessors and the equivalent mutators, such as in skipBlanks():

    // before

    protected void skipBlanks() {
        while (this.pstr.position + 2 < this.pstr.length && currentChar() != '<' && !this.str.substring(this.pstr.position, this.pstr.position + 2).equals("{@") && !Character.isLetterOrDigit(currentChar())) {
            ++this.pstr.position;
        }
    }

    // after

    protected void skipBlanks() {
        while (this.pstr.hasChar() && !this.pstr.isMatch("<") && !this.pstr.isMatch("{@") && !Character.isLetterOrDigit(currentChar())) {
            this.pstr.advancePosition();
        }
    }

That’s still the worst offender in terms of “width”, the number of characters wide a line is, which is a good way to find candidate code to be rewritten “skinnier”. (Other ways to find candidate code: by number of lines, and by depth of nested blocks, a variant of width).

In the next part of this series we’ll wrap it up with a look at performance and code quality.

A Day Refactoring Code, Part 2: Unit Testing

Continuing from part 1, we’ll begin writing test code from low-level functionality to higher-level, essentially paralleling the class as it grows in functionality as well.

It might seem like overkill to test very low-level code (such as getX()), but without low-level tests, a programmer may find himself analyzing the code at an improper level, mistaking for complex errors problems that are more fundamental.

Not testing low-level functionality led to a problem when I was prototyping a Javadoc parser (for DoctorJ). I wrote the original prototype (redundancy alert) in Ruby, then began migrating it to JRuby (intending to link it with Java code).

When I switched from Ruby to JRuby, there were failed unit tests, and I assumed that the problem was in my code, not in JRuby. (Yes, I adhere to the principle that “select ain’t broken”.)

Eventually, after much work, I discovered that a bug in the Regexp.last_match functionality in JRuby meant that the very high-level parsing functionality unit tests failed, and it took a lot of effort to drill down through the parsing code to discover that the issue was with the Regexp engine itself. I was thankful for the opportunity to contribute a fix for the issue, which was released with JRuby 1.6.5.

If I had had simpler tests, I probably would have found the issue sooner.

Back to unit testing in general, a good rule is to write tests in the order of the number of instance variables they test, either directly or indirectly. Thus the simplest tests would invoke methods that use only one instance variable (such as getX and setX). Subsequent tests should accrete in complexity, testing more instance variables.

This is essentially taking the concept from integration tests (multiple components) within unit testing itself, seeing variables (and sequences of methods and usage of multiple classes) as components as well. Thus unit testing begins with the smallest *unit* (variables), and expands to the point of being in the gray area where integration tests begin.

Back to this example, the next method to test is the one returning the character at the current position. Again we’ll write a custom assertion, then the test itself.

Implementing the tests such that it advances the position beyond the end of the string results in an exception:

        // out of range:
        str0.advancePosition(12);
        assertCurrentChar(null, str0);
java.lang.StringIndexOutOfBoundsException: String index out of range: 15
	at java.lang.String.charAt(String.java:694)
	at org.incava.text.PosString.currentChar(PosString.java:75)
	at org.incava.text.TestPosString.assertCurrentChar(TestPosString.java:36)
	at org.incava.text.TestPosString.testCurrentChar(TestPosString.java:84)

That shows that currentChar() should be updated to test that it is within bounds:

from:

        return this.str.charAt(this.pos);

to:

        return this.pos < this.len ? this.str.charAt(this.pos) : null;

The next test is for the hasMore method. What is important here is that we are testing the edge condition, for when pos is 1 less than len, is equal to len, and is 1 greater than it.

        str0.advancePosition();
        assertHasMore(true, str0);

        str0.advancePosition(12);
        assertHasMore(true, str0);

        str0.advancePosition();
        assertHasMore(false, str0);

        str0.advancePosition();
        assertHasMore(false, str0);

This seems to work … except that we didn't actually test the edge condition by hand. Why not? Laziness (false), where we just assumed that the string is 13 characters long.

Simplicity

This highlights another tenet of good unit testing: simplicity. The string we tested is so long ("this is a test" is my default string for string/regexp testing) that we didn't do the test "manually".

It's been a while that we've been rewriting the code, so we can consider checking this into our VCS, or we can just keep going.

When I'm on a multi-person project, I (nearly) never check in code that will fail the unit tests. In contrast, in my local environment, I find it easier to restart work after a break (especially when starting work in the morning) to have the project in the state of a failed unit test. This immediately gets me to the latest point of progress, and focuses my attention on the specific problem I was last working on. Buffer lists just aren't enough, since they only show me the latest files that I was working on, but not exactly what.

So we check in the code, then get back to the hasMore method. We made a mistake by, again, starting with overly complicated tests, in this case a string that was too long, so we'll go simple here, adding the field, and amending the method:

    private PosString abc;

    public void testHasMore() {
        assertHasMore(true, abc); // at 'a'
        // no change
        assertHasMore(true, abc);

        abc.advancePosition();  // at 'b'
        assertHasMore(true, abc);

        str0.advancePosition(); // at 'c'
        assertHasMore(true, abc);

        str0.advancePosition(); // off end
        assertHasMore(false, str0);
    }

Did you see the error? That was the first version that I wrote, which led to this failure:

junit.framework.AssertionFailedError: 'this is a test' (2) expected:<false> but was:<true>
	at org.incava.text.TestPosString.assertHasMore(TestPosString.java:44)
	at org.incava.text.TestPosString.testHasMore(TestPosString.java:123)    

‘this is a test’? This error was caused by my not doing a replace-all within the method, instead editing it “by hand”. (I’m convinced that copying and pasting, combined with search and replace, are responsible for 77.1% of software problems.)

We’ll fix our method:

    public void testHasMore() {
        assertHasMore(true, abc); // at 'a'
        // no change
        assertHasMore(true, abc);

        abc.advancePosition();  // at 'b'
        assertHasMore(true, abc);

        abc.advancePosition(); // at 'c'
        assertHasMore(true, abc);

        abc.advancePosition(); // off end
        assertHasMore(false, abc);
    }

Okay, that works, but should it? Is there “more” when the position is at the last character? I say no, so we change the hasMore code to call the method with num == 1:

    public boolean hasNumMore(int num) {
        return this.pos + num < this.len;
    }

We update the test:

        abc.advancePosition(); // at 'c'
        assertHasMore(false, abc);

Now the unit tests run without failure. Note that parameters to methods, and variables in general, have a length in characters proportionate to their scope. It's overkill to write "indexOfTheArray" in a small for-loop, whereas having instance variables named "td" are inadequately descriptive.

Writing the hasNumMore tests gives us a chance to use the Range class from the IJDK library, making for a bit terser and cleaner syntax:

    public void testHasNumMore() {
        for (int i : new Range(0, 10)) {
            assertHasNumMore(true, tenletters, i);
        }
    }

Here we're using another test string, "tenletters", which happens to be ten letters long. Fun, huh?

This is an excellent example of why custom assertions are valuable. Within a loop (checking a list of values), without a message for the assertEquals(), we would not know the value for which the assertion failed. But in this case, with the custom assertion, we get the output showing the value (10) that failed:

junit.framework.AssertionFailedError: pstr: ''tenletters' (0)'; num: 10 expected:<true> but was:<false>
	at org.incava.text.TestPosString.assertHasNumMore(TestPosString.java:51)
	at org.incava.text.TestPosString.testHasNumMore(TestPosString.java:131)

That actually shows that the test is incorrect: there are only 9 more characters from the initial position in the string. So we change the end of the range from 10 to 9 (that is, the range is 0 through 9), rerun, and the tests pass.

From the old class, we will pull the consume and consumeTo, amending their names and their code, starting with advanceTo, which is derived from the old consumeTo:

    public void advanceTo(String what) {
        while (this.pos < this.len && this.pos + what.length() < this.len && !this.str.substring(this.pos).startsWith(what)) {
            ++this.pos;
        }
    }

    public void testAdvanceTo() {
        assertCurrentChar('a', abc);
        abc.advanceTo("b");
        assertCurrentChar('b', abc);
        abc.advanceTo("c");
        assertCurrentChar('c', abc);
        abc.advanceTo("d");
        assertCurrentChar(null, abc);
    }

That raises another question: what happens when we advance to the end of the string? Should the current character be the last one, or should it be null?

We'll amend it to be null, and clean up the code, which has too long of a conditional (and redundant, with the "pos < len && pos + n < len" statements,
adding a new method, isMatch:

        while (this.pos < this.len && !isMatch(what)) {
            this.pos++;
        }

    public boolean isMatch(String what) {
        return this.pos + what.length() <= this.len && this.str.substring(this.pos).startsWith(what);
    }

Bringing over the old consumeFrom method gives us a chance to use the new isMatch method, for clearer code:

    protected boolean consume(String what) {
        skipBlanks();
        if (this.pos + what.length() < this.len && this.str.substring(this.pos).startsWith(what)) {
            this.pos += what.length();
            return true;
        }
        else {
            return false;
        }
    }

    public boolean advanceFrom(String what) {
        if (isMatch(what)) {
            advancePosition(what.length());
            return true;
        }
        else {
            return false;
        }
    }

This is in keeping with the principle of having fewer dots per line. more dots indicates a violation of the Law Of Demeter, with excessive low-level access to instance variables.

Notice that this method, as we've advanced through the class, is using no instance variables directly. Essentially the class is "coming up" from a low-level implementation closer to the point of how it will be used.

We'll bring over consumeFromTo, renaming it advanceFromTo. Again, we have no instance variables directly accessed:

    public void advanceFromTo(String from, String to) {
        if (advanceFrom(from)) {
            advanceTo(to);
        }
    }

We got all the way to here, and from looking at the code, we have an edge condition issue, as to what hasMore means. Does it mean "has more characters remaining to the end of the string", or does it mean "has a current character", that is, is not at the end?

Following the principle of when in doubt, copy, we'll imitate the behavior of the Java collections, specifically Iterable, for which hasNext() means "has a current element", and next() takes it.

So hasMore() will be amended to mean "has a current character", and will be renamed to hasChar(), which is less ambiguous:

    public boolean hasChar() {
        return this.pos < this.len;
    }

Notice that we slipped isMatch into the code, without adding tests for it, so let's fix that now, moving the tests before those for the advanceFrom/To methods, since isMatch is lower level, that is, is used by advanceFrom and advanceTo.

The implementation of assertIsMatch quotes the string. Against this can be useful, in cases where the string would be a whitespace character.

        String msg = "pstr: " + pstr + "; str: '" + str + "'";
        assertEquals(msg, exp, pstr.isMatch(str));

Repetition in Tests

We write the advance* methods, and end up with this one for testAdvanceFromTo():

    public void testAdvanceFromTo() {
        assertCurrentChar('a', abc);
        abc.advanceFromTo("a", "c");
        assertCurrentChar('c', abc);

        assertCurrentChar('t', tenletters);

        // no change
        tenletters.advanceFromTo("e", "n");
        assertCurrentChar('t', tenletters);

        // to first e
        tenletters.advanceFromTo("t", "e");
        assertCurrentChar('e', tenletters);
        assertPosition(1, tenletters);

        // to second e
        tenletters.advanceFromTo("e", "e");
        assertCurrentChar('e', tenletters);
        assertPosition(4, tenletters);

        // to second t
        tenletters.advanceFromTo("e", "t");
        assertCurrentChar('t', tenletters);
        assertPosition(5, tenletters);

        // to third t (next character)
        tenletters.advanceFromTo("t", "t");
        assertCurrentChar('t', tenletters);
        assertPosition(6, tenletters);
    }

By using the higher-level (custom) assertions the code is much cleaner, but there is obvious repetition, so we will extract the common code into a method that executes a subset of behavior and assertions.

    protected void doAdvanceFromToTest(Character expChar, int expPos, PosString pstr, String from, String to) {
        pstr.advanceFromTo(from, to);
        assertCurrentChar(expChar, pstr);
        assertPosition(expPos, pstr);
    }

Our testAdvanceFromTo method is now much cleaner:

    public void testAdvanceFromTo() {
        assertCurrentChar('a', abc);
        doAdvanceFromToTest('c', 2, abc, "a", "c");

        assertCurrentChar('t', tenletters);

        // no change
        doAdvanceFromToTest('t', 0, tenletters, "e", "n");

        // to first e
        doAdvanceFromToTest('e', 1, tenletters, "t", "e");

        // to second e
        doAdvanceFromToTest('e', 4, tenletters, "e", "e");

        // to second t
        doAdvanceFromToTest('t', 5, tenletters, "e", "t");

        // to third t (next character)
        doAdvanceFromToTest('t', 6, tenletters, "t", "t");

        // to end of string
        doAdvanceFromToTest('s', 9, tenletters, "t", "s");

        // off end of string
        doAdvanceFromToTest(null, 10, tenletters, "s", "j");
    }

A simple rule of economics is as cost decreases, demand (usage/consumption) increases, and that rule applied to testing is that as tests are easier (and more fun, frankly) to write, people will write more tests.

Note in the example above, with the low-level functionality and testing put into a separate method, the test code can focus on the edge conditions, such as “what happens if I’m on the same string as the one I want to advance to?” and “what happens when one advances to the end of the string?”

Edge conditions are where code will fail. Beware of thinking that more assertions are better than fewer; quality of assertions is better than sheer mass. (And “Sheer Mass” would be a good name for a rock band.)

In the next post, we’ll discuss naming, “the essence of programming”.

A Day Refactoring Code, Part 1: Rewriting

This post continues the overview of code being refactored.

We’ll pull the str, pos, len fields and build a new class around them. Naturally, the first step is writing a unit test, beginning by testing the lowest-level functionality, which set, get, and advance. That low-level functionality is only an intermediate step in getting the class we want, since eventually we’ll have some more expansive behavior.

So for now those methods are public, but we’ll probably eliminate them or reduce their accessibility, thus encouraging usage of higher-level methods instead.

For each method we’ll be testing, we will have a custom assertion, an assertion that essentially wraps the low-level assertEquals with a more expressive message associated with it.

Custom assertions should generally follow the form:

    public void assertMethodName(VType expectedValue, OType objTested, Object ... methodArguments) {
        String msg = "obj: " + objTested + "; arguments: " + methodArguments;
        assertEquals(msg, exp, objTested.methodName(methodArguments));
    }

for example, here is the custom assertion for testing substring, which takes an
integer argument:

    public void assertSubstring(String exp, PString pstr, int pos, int num) {
        String msg = "pstr: " + pstr + "; pos: " + pos + "; num: " + num;
        assertEquals(msg, exp, pstr.substring(pos, num));
    }

This produces the output:

junit.framework.ComparisonFailure: pstr: 'abc' (0); pos: 1; num: 1 expected:<b> but was:<b>
	at junit.framework.Assert.assertEquals(Assert.java:85)
	at org.incava.text.TestPString.assertSubstring(TestPString.java:44)
	at org.incava.text.TestPString.testSubstringWithPosition(TestPString.java:110)

In comparison, unit tests that use the low-level assertEquals such as the follow result in the subsequent unit test output:

    public void assertSubstring(String exp, PString pstr, int pos, int num) {
        assertEquals(exp, pstr.substring(pos, num));
    }
junit.framework.ComparisonFailure: expected:<b> but was:<b>
	at junit.framework.Assert.assertEquals(Assert.java:85)
	at junit.framework.Assert.assertEquals(Assert.java:91)
	at org.incava.text.TestPString.assertSubstring(TestPString.java:48)
	at org.incava.text.TestPString.testSubstringWithPosition(TestPString.java:114)

In the first example, we know the position (1), number (1), and information about PString itself (its wrapped string is “abc”, and it’s at position 0).

In the next post, we’ll delve more deeply into the unit tests.

A Day Refactoring Code, Part 0: Overview

I’ve been rewriting much of DoctorJ, which has been suffering some code rot after several years of existence, now at major release 5.

While poking through the code, I found the class ParsingSpellChecker, which as its name implies, spell-checks, in particular working on Javadoc comment blocks, such as skipping <pre> blocks and "{@link" tags:

package org.incava.text.spell;

import java.io.InputStream;
import java.util.*;
import org.incava.ijdk.util.MultiMap;

public class ParsingSpellChecker {
    /**
     * Whether we can spell check, which we can't until we get a word or a
     * dictionary (which is a thing with a lot of words).
     */
    private boolean canCheck;

    /**
     * The spell checker, which doesn't parse.
     */
    private final SpellChecker checker;

    /**
     * The current string we're working on.
     */
    private String str;

    /**
     * The length of the current string.
     */
    private int len;

    /**
     * The current position within the string.
     */
    private int pos;
    
    public ParsingSpellChecker(SpellChecker checker) {
        this(checker, false);
    }

    public ParsingSpellChecker(SpellChecker checker, boolean canCheck) {
        this.checker = checker;
        this.canCheck = canCheck;
    }

    public boolean addDictionary(String dictionary) {
        this.canCheck = this.checker.addDictionary(dictionary) || this.canCheck;
        return this.canCheck;
    }

    public boolean addDictionary(InputStream dictionary) {
        this.canCheck = this.checker.addDictionary(dictionary) || this.canCheck;
        return this.canCheck;
    }

    public void addWord(String word) {
        this.checker.addWord(word);
        this.canCheck = true;
    }

    public void check(String str) {
        if (this.canCheck) {
            this.str = str;
            this.len = this.str.length();
            this.pos = 0;
    
            while (hasMore()) {
                skipToWord();
                if (hasMore()) {
                    if (Character.isLetter(currentChar())) {
                        checkCurrentWord();
                    }
                    else {
                        // not at an alpha character. Might be some screwy formatting or
                        // a nonstandard tag.
                        skipThroughWord();
                    }
                }
            }
        }
    }

    /**
     * Consumes from one string to another.
     */
    protected void consumeFromTo(String from, String to) {
        if (consume(from)) {
            consumeTo(to);
        }
    }

    /**
     * Skips content between pairs of brackets, a la HTML, such as, oh,
     * "<html>foobar</html>." Doesn't handle slash-gt, such as "<html
     * style="lackthereof" />".
     */
    protected void skipBracketedSection(String section) {
        consumeFromTo("<" + section + ">", "</" + section + ">");
    }

    protected void skipLink() {
        consumeFromTo("{@link", "}");
    }
    
    protected void skipBlanks() {
        while (this.pos + 2 < this.len && currentChar() != '<' && !this.str.substring(this.pos, this.pos + 2).equals("{@") && !Character.isLetterOrDigit(currentChar())) {
            ++this.pos;
        }
    }
    
    protected void skipToWord() {
        skipBracketedSection("code");
        skipBracketedSection("pre");
        skipLink();
        consume("&;nbsp;");
    }

    protected void checkWord(String word, int position) {
        MultiMap<Integer, String> nearMatches = new MultiMap<Integer, String>();
        boolean valid = this.checker.checkCorrectness(word, nearMatches);
        if (!valid) {
            wordMisspelled(word, position, nearMatches);
        }
    }

    protected void wordMisspelled(String word, int position, MultiMap<Integer, String> nearMatches) {
        int nPrinted = 0;
        final int printGoal = 15;
        for (int i = 0; nPrinted < printGoal && i < 4; ++i) { // 4 == max edit distance
            Collection<String> matches = nearMatches.get(i);
            if (matches != null) {
                for (String match : matches) {
                    System.out.println("    near match '" + match + "': " + i);
                    ++nPrinted;
                }
            }
        }
    }

    protected boolean consume(String what) {
        skipBlanks();
        if (this.pos + what.length() < this.len && this.str.substring(this.pos).startsWith(what)) {
            this.pos += what.length();
            return true;
        }
        else {
            return false;
        }
    }

    protected void consumeTo(String what) {
        int len = this.str.length();
        while (this.pos < len && this.pos + what.length() < len && !this.str.substring(this.pos).startsWith(what)) {
            ++this.pos;
        }
    }

    protected Character currentChar() {
        return this.str.charAt(this.pos);
    }

    protected boolean hasMore() {
        return this.pos < this.len;
    }

    protected void checkCurrentWord() {
        StringBuffer word = new StringBuffer();
        word.append(currentChar());
        
        int startingPosition = this.pos;
        boolean canCheck = true;

        ++this.pos;

        // spell check words that do not have:
        //     - mixed case (varName)
        //     - embedded punctuation ("wouldn't", "pkg.foo")
        //     - numbers (M16, BR549)
        while (hasMore()) {
            char ch = currentChar();
            if (Character.isWhitespace(ch)) {
                break;
            }
            else if (Character.isLowerCase(ch)) {
                word.append(ch);
                ++this.pos;
            }
            else if (Character.isUpperCase(ch)) {
                skipThroughWord();
                return;
            }
            else if (Character.isDigit(ch)) {
                skipThroughWord();
                return;
            }
            else {
                // must be punctuation, which we can check it if there's nothing
                // but punctuation up to the next space or end of string.
                if (this.pos + 1 == this.len) {
                    // that's OK to check
                    break;
                }
                else {
                    ++this.pos;
                    while (hasMore() && !Character.isWhitespace(currentChar()) && !Character.isLetterOrDigit(currentChar())) {
                        // skipping through punctuation
                        ++this.pos;
                    }
                    if (this.pos == this.len || Character.isWhitespace(currentChar())) {
                        // punctuation ended the word, so we can check this
                        break;
                    }
                    else {
                        // punctuation did NOT end the word, so we can NOT check this
                        skipThroughWord();
                        return;
                    }
                }
            }
        }

        // has to be more than one character:
        if (canCheck && this.pos - startingPosition > 1) {
            checkWord(word.toString(), startingPosition);
        }
    }

    protected void skipThroughWord() {
        ++this.pos;
        while (hasMore() && !Character.isWhitespace(currentChar())) {
            ++this.pos;
        }
    }
}

The problem with ParsingSpellChecker is that the low-level functionality (going through a string by “position” (current index))and checking against string length) don’t match the more high-level behavior of the class.

Another red flag in the class is there are three instance fields (len, pos, and str) that are very tightly connected, with essentially each of them having to be used with one another.

It is also telling that the indentation levels within the class are very deep, also indicating poor abstraction. The fewer indentation levels, the more likely it is that the code is evenly abstracted at the proper level.

Code repetition is another issue, with statements such as this.pos++ found in many locations.

From running PMD, we know that StringBuilder is preferred over StringBuffer for performance reasons. Similarly, by building its own substrings (via StringBuffer.append()), the code does not take advantage of String.substring, which uses offsets in the newly-created String to refer to a substring within the lower-level char array.

That’s the overview. In the next post, we’ll begin rewriting this mess.

DoctorJ 5.2.0 and DiffJ 1.2.0 Released

DoctorJ 5.2.0 is an extensive rewrite of the code, with unit tests expanded and refined. It uses the latest version of PMD as the parsing and AST code, and is the first version of DoctorJ to use the IJDK module, leading to much more elegant code. In terms of functionality this version add spell-checking for strings of a minimal length (which defaults to 2). The project now builds with Gradle.

DiffJ 1.2.0 also has significantly rewritten code, and also uses the IJDK module. This version adds (on terminals that support it) colorization of differences. It too uses Gradle for its build.

Both projects are now distributed only in zip format, and I welcome offers to repackage them for various package managers.

The distributes are available for download: