Tests Result in Better Programs, and Programmers

It’s widely noted and accepted that tests result in better code. My conjecture is that tests result in vastly improved programmers.

Even with a good test suite, the first implementation of a project will be utilitarian, complying with the edict of “just get it working!”. This are the red and green phases of the red/green/refactor cycle: red is when the tests fail, and green is when they pass. Too often, that initial version is the end of the cycle for many programmers and projects, which is why, in general, version 1 of a program is horrible:

win101logo

In the refactoring phase the project goes from simply “working” to being well-crafted. Thus the programmer as well has changed their mindset from writing “working” code to writing well-crafted code, and the programmer has been elevated to a higher level of craftsmanship.

Refactoring code is when a programmer can develop their skills and push the boundaries of their knowledge, such as delving into advanced object-oriented programming and metaprogramming, which are often considered dauntingly complex and risky. But used properly, they can dramatically improve a project, and I believe what is more important, they can dramatically improve the programmer.

In my experience, this was first proven to me when I was working on a large C++ project, my module being our persistence layer, providing a J2EE-like (but vastly simpler) interface with PostgreSQL. Although in that era (the late 1990s) Test-Driven Development hadn’t become popular, my daily goal was to write more test code than “real” code, usually between 500 and 1500 lines per day, usually, but not always, writing my tests first.

As my test suite grew and I became more confident in its ability to catch errors, I pushed my knowledge of C++, particularly with templates and the Standard Template Library, eventually reaching a point when I really understood the magic of the STL code. Had I not had such tests, and thus confidence, I likely would have refrained from pushing myself into an area that previously was in the dark, murky area of C++ to me, and I would not have gained valuable expertise as a C++ programmer.

In fact, I’d say that it’s only because of tests that I’ve felt confident learning a new language, such as when I rewrote DiffJ in JRuby, a language I hadn’t used. Having a thorough test suite made the learning process much easier.

The bottom line is that programmers must understand that testing includes refactoring (including of the test code itself), and that the refactoring phase is where programmers, and projects, can become significantly better.

Oh, and with thorough refactoring that horrible version 1 can eventually lead to significant improvements:

macos

Advertisements

Tests Result in Better Coders

We know that tests result in better code. They also result in better programmers.

The first implementation of a project will often be utilitarian, complying with the edict of “just get it working”. Code will be written and tweaked until the tests pass, but there are few cycles, if any, devoted to refining the code to be higher quality. This the red/green/refactor cycle is limited to just red and green.

The refactoring phase is where the code goes from simply working to being well-crafted. As the mindset of the programmer changes from being focused on utility to being focused on quality, they program accordingly different.

With the principle of quality being foremost, and with the confidence from having a thorough set of tests, the programmer can also push the boundaries of their knowledge, such as delving into advanced object-oriented programming and metaprogramming, which can be dauntingly complex and risky. Those areas do not necessary have an immediate impact on functionality, so when a programmer is in functionality/utility mode, they’re less likely to employ those techniques. However, in quality mode, a programmer will feel more justified, and confident, in using those techniques, which over the long term can cause a dramatic improvement to a project.

Often forgotten is that both code and tests should be refactored. Test code that is unclear, misleading, or just wrong can be very frustrating to someone trying to understand a body of code, since the tests are the best starting point.

The bottom line is that programmers must understand that testing includes refactoring (including of the test code itself), and that the refactoring phase is where programmers, and projects, can become vastly better.

Related posts:

Rewriting Glark

For the PVN project, I’ve wanted to use Glark as a library for matching text (for the ‘seek’ subcommand), but I’d written Glark as a command-line application, and its design reflected that. It also, like so many “field-tested” programs, had very few tests, with the expectation that because
of being heavily used, flaws would easily surface. That’s relatively accurate: I’m fairly sure that I use Glark more than any other program not built into Unix (I even have grep aliased to “glark -g”).

I was once asked in a job interview what my opinion was of my own code. My response was that my code evidently sucks, because I’m always rewriting it. That was definitely true with Glark, being one of my first Ruby programs, migrating it from Perl, and not having touched it in a long time.

The problems with Glark were classics of bad programming: lack of tests, overly complicated code, use of global variables, poor class composition, and excessive coupling.

So I’ve been eager to rewrite Glark, but without tests, a program is much too brittle, so I knew that I’d first have to add tests. I simple can’t enjoy writing code without tests. However, with a comprehensive test suite, rewriting code is bliss. So I first wrote a few tests, then refined my test framework to the point that writing a unit test is as simple as this:

def test_simple
  fname = '/proj/org/incava/glark/test/resources/textfile.txt'
  expected = [
              "    3   -rw-r--r--   1 jpace jpace   45450 2010-12-04 15:24 02-TheMillersTale.txt",
              "   10   -rw-r--r--   1 jpace jpace   64791 2010-12-04 15:24 09-TheClerksTale.txt",
              "   20   -rw-r--r--   1 jpace jpace   49747 2010-12-04 15:24 19-TheMonksTale.txt",
              "   24   -rw-r--r--   1 jpace jpace   21141 2010-12-04 15:24 23-TheManciplesTale.txt",
             ]
  run_app_test expected, [ '--xor', '\b6\d{4}\b', 'TheM.*Tale' ], fname
end

That’s Glark matching 6nnnn ^ TheM*Tale. At this point, grep has added some of what made Glark quite distinct from it — highlighted/colorized matches and context — but Glark’s most unusual (and fun to program) feature is matching of compound expressions, such as:

% glark --and=3 write --or puts print **/*.rb

That is matching “write” within 3 lines of puts or print.

Do you need that very often? Nope. But it does come in handy, such as in the case of “I’m looking for where we are catching an InvalidArgumentException and logging it (within the next 5 lines) as an error:

% glark --and=5 'catch.*InvalidArgumentException' 'Log.error' **/*.java

Speaking of interviews, a friend of mine has a good practice of when he goes to a company to assess their software, he asks to see what they consider their worst code. Often that code is at the core of their project and is the oldest code, written early on by someone who may have left the company, and/or the code has been piled on with more and more complexity that it is difficult to detangle.

In Glark, the worst code of the entire application was the Options class, clocking in at 761 lines long, containing the 42 options in Glark. This class is a Singleton, which is the fancy Design Pattern way of saying Global Variable. (The worst by-product of the Gang of Four was the sanctifying of Singletons as being a good practice.)

Another sign that the Options class was written poorly is an insanely simple metric: wc. That is, running the “wc” command on all files, sorting them numerically, and looking at the largest files. There are the bottom, in all its corpulent glory:

% wc **/*.rb | sort -rn
    4     7   160 lib/glark.rb
  102   654  5213 lib/glark/help.rb
  183   527  4569 lib/glark/input.rb
  248   640  6064 lib/glark/exprfactory.rb
  266   681  6052 lib/glark/output.rb
  297   777  7392 lib/glark/glark.rb
  440  1048  9663 lib/glark/expression.rb
  761  2615 23377 lib/glark/options.rb
 2301  6949 62490 total

The Options class is used throughout Glark, so extracting it was quite challenging. I decomposed the Options class into smaller groups, and it just so happened that there was a design to follow, documented in, of all places, the help (man) page for Glark. That is, because there are so many options, for legibility and organization they are displayed in the man page in the groups “input”, “matching”, “output”, and “debugging/errors”. So I repackaged the Glark options into input, match, output, and info, and also used that as the organization for the modules within Glark. Thus the Glark::File class went into lib/glark/input/file.rb, and Grep::Lines went into lib/glark/output/grep_lines.rb.

I continued to refine the tests to the point that adding new ones was trivial. As the test coverage increased, this had the effect of making it an aberration when I worked on untested code.

On that note, here’s an easy way to test your test. That is to determine if your test really works, break the code that it is testing. (A “return nil if true” at the beginning a method works nicely.) If the test still passes, then it’s incomplete. If the test fails, then should add confidence that test coverage is adequate. This is also where it becomes fun to break code, and to break tests. As with anything, if it’s fun, we’ll do more of it, which is why it is essential to have a test framework that makes tests easy (ergo, fun) to write.

For years I’ve tracked my daily progress by the simple metric of lines of code, but after hearing this suggested on the Ruby Rogues podcast, I’ve begun the practice of adding one user-facing feature per day, such as a new subcommand or option to PVN. My definition of “feature” includes adding and refining documentation, and it also includes removing options, especially if they are confusing, redundant, unused or obsolete.

I track features with a Features.txt in the root directory of each project, of the form (from PVN):

Thu Oct 25 19:22:52 2012

  seek command: added [ -C --no-color ] option.

This keeps me on track by actually recording features that are added. A script I run on all {project}/Features.txt files shows whether I’ve added one for today, and when I’ve missed a day (only one since I started doing this).

My process for adding features feels like an extension of the TDD process, in short:

  • conceive of a feature
  • add it as a test
  • implement it
  • run tests
  • refactor the tests and code …
  • document the feature in the readme and help files
  • add the feature to the features file
  • commit with the feature description as the comment description.

That’s about it for this update on the coarse rewriting of Glark. You can track the progress of it on GitHub (https://github.com/jeugenepace/glark), and if you want to see the code before the rewrite, check out revision 4d10f192f46ec3df34f971f8b40e03f8df0aed27.

Object Calisthenics: Stretch and Twist …

Quite a while ago I first read Object Calisthenics, by Jeff Bay, in the Thoughtworks Anthology, and it has resulted in pivotal change in how I write software. A testimony to its value to me is that I’ve reread it several times, putting it into what I would consider my desert-island set of books.

A PDF of it is available here.

Initially some of the suggestions struck me as odd, but overall I agreed with them. With the 1.3.0 rewrite of DiffJ, I decided to make more of an effort to put all of the principles into practice, and analyze and review the experience.

The difference was amazing.

The idea of Object Calisthenics is that the high-level concepts and principles of good object-oriented programming can be distilled into low-level rules to follow.

The rules are:

  • One level of indentation per method
  • Don’t use the “else” keyword
  • Wrap all primitives and strings
  • First class collections
  • One dot per line
  • Don’t abbreviate
  • Keep all entities small
  • No classes with more than two instance variables
  • No getters/setters/properties

Some of those rules sound familiar (keeping entities small, avoiding getX/setX), but others seem bizarre: don’t use else? wrapping primitives? no more than two instance variables?

But they work, and make sense.

1. One level of indentation per method.

This rule is simple, and applied, greatly reduces complexity. I avoid long methods — never more than a vertical screenful — and “wide” methods should be equally avoided. A good metric is to consider the complexity of a method equal to the product of its length (lines of code) and its depth (indentation). (Maybe one of these days I’ll revive statj, for Java statistical summaries.)

This rule also reduces complexity because it pushes code out of the depths of the problematic method to the objects being acted on. Thus a refactoring process might go from the original code of:

    public void moveEntities(List<Entity> entities) {
        for (Entity entity : entities) {
            if (isOnScreen(entity)) {
                for (Entity other : entities) {
                    if (entity != other && entity.nextLocation().equals(other.location())) {
                        collide(entity, other);
                    }
                    else {
                        moveEntity(entity);
                    }
                }
            }
        }
    }

to:

    public void moveEntities(List<Entity> entities) {
        for (Entity entity : entities) {
            processEntityLocation(entity, entities);
        }
    }

    public void processEntityLocation(Entity entity, List<Entity> entities) {
        if (!isOnScreen(entity)) {
            return;
        }

        Entity other = getCollisionEntity(entity, entities);
        if (other == null) {
            moveEntity(entity);
        }
        else {
            collide(entity, other);
        }
    }

    public Entity getCollisionEntity(Entity entity, List<Entity> entities) {
        for (Entity other : entities) {
            if (entity != other && entity.location().equals(other.location())) {
                return other;
            }
        }
        return null;
    }

Note the two levels of indentation in the getCollisionEntity method. I haven’t found a way to do the for-if-return loop with only one level.

The primary motivation of this rule is that one is not dealing with multiple levels — both indentation and abstraction — at once.

The bottom line: it works. I’ve written much more straightforward code this way, and the result is small methods that can then be pushed out of the current class to the primary object being acted upon, which generally, is the first parameter of the method.

For example, we could move getCollisionEntity to the Entity class:

public class Entity {
    public Entity getCollisionEntity(List<Entity> entities) {
        for (Entity other : entities) {
            if (this != other && location().equals(other.location())) {
                return other;
            }
        }
        return null;
    }

    // ...
}

2. Don’t use the “else” keyword

The argument is to reduce the number of branches, keeping long conditional blocks out of the code.

This seems less justified than the other rules, but one way I have found it beneficial is to use the ternary operator instead of if-else, which has two results: very simple statements, and concise methods. Of course, this only works if the return types are compatible.

Modifying the code above, the result is:

    return other == null ? moveEntity(entity) : collide(entity, other);

I find that much easier to read, and it avoids the vertical code sprawl.

My spin on this rule is to avoid the “else if” sequence, that is, the long switches as seen below, which can oftentimes be refactored to be more polymorphic.

    if (x) {
        ....
    } else if (y) {
        ....
    } else if (z) {
        ....
    }

3. Wrap all primitives and strings

This I found difficult to get used to, but now I’m a vehement advocate, for a reason not cited in the original paper.

The reason he states is for type-checking and conveying information to another programmers about what the variable actually is.

If you have type checking, use it. A (the?) strength of Java is the compiler (OK, the JVM too), which can find our silly errors, such as:

    void launchMissile(boolean isNuclear, boolean isTest, Coordinates target);

    // run a non-nuclear missile test:
    launchMissile(false, target, false);

It’ll give us a nice helpful friendly message, such as:

Norad.java:13: launchMissile(boolean,boolean,java.lang.String) in mil.norad.Norad cannot be applied to (boolean,java.lang.String,boolean)

However, the compiler can’t help us beyond that. Such as:

    // run a non-nuclear missile test:
    launchMissile(true, false, target);

It might seem like overkill (pun intended) to wrap the boolean (or use enums), but look at the result:

    void launchMissile(MissileType missileType, TestStatus testStatus, Coordinates target);

    // run a non-nuclear missile test:
    launchMissile(MissileType.NON_NUCLEAR, TestStatus.IS_TEST, target);

And, going back to Ruby, because Ruby (unlike Java) allows subclassing of all built-in types, it’s easy to simply derive a new type, and eventually to expose the necessary methods of the superclasses as delegators in the new class, and then to un-derive the class from the primitive type.

The original paper highlights the benefit of wrapping primitive types as for conveying meaning (to the compiler and to the programmer), but I think there’s a more important reason, and benefit, for avoiding primitive types: the fact that it is nearly certain that the behavior of the primitive type will not match how a variable is used.

For example, in a given body of code, one might write this:

    String userName;

That is saying that a user name is a string. Okay, let’s see if that’s right:

  • Strings can have zero length. User names cannot.
  • The length of a String can be 2,147,483,647. User names cannot.
  • Strings can contain spaces. User names cannot.
  • … etc. …

Another problem with this is that Strings have functionality (methods) that makes no sense for a user name. lastIndexOf? replaceAll? split? java.lang.String has 69 constructors and instance methods, and a user name would use only a few of those.

Thus the declaration “String userName” is quite incorrect (as in “lying”). One might say that “a user name is a string, but with certain restrictions, and with only a subset of the functionality of java.lang.String, and with additional behavior and associations.” That is, there (at least conceptually, if not in the code) is a class UserName, which is approximately, but hardly exactly, a string. And the more approximate that equivalence is, the greater the burden on the programmer and on other code to perform the mapping.

4. First class collections

From the original: "any class that contains a collection should contain no other member variables." Said differently, that means that a collection should be wrapped in a class specific to how the collection is used.

Thus the logic for using a collection goes into the wrapper class, instead of in the "client" code. This feels more like working with an object instead of in it.

For example, a set of scores (mapped from a name to a list of scores) could be implemented as a simple class, with the average and total methods added:

class ScoreList
  def initialize
    @scores = Array.new
  end

  def << score
    @scores << score
  end
    
  def [] index
    @scores[index]
  end

  def total
    @scores.inject(0) { |sum, n| sum + n }
  end

  def size
    @scores.size
  end

  def to_s
    @scores.to_s
  end
end

class Scores
  def initialize
    @scores = Hash.new { |h, k| h[k] = ScoreList.new }
  end

  def add name, score
    @scores[name] << score
  end

  def scores name = nil
    name ? @scores[name] : @scores.values
  end

  def average
    @scores.empty? ? nil : total / count
  end

  def total
    sum_of_scores { |list| list.total }
  end

  def count
    sum_of_scores { |list| list.size }
  end

  def to_s
    @scores.inspect
  end

  def sum_of_scores &blk
    scores.inject(0) { |sum, list| sum + blk.call(list) }
  end
end

As method parameters in a type-checked language, it self-documents the method, and keeps the variable types shorter than using the equivalent Java generics:

    public void process(Map<String, List<Integer>> scores);

    public void process(Scores scores);

And with the earlier example, the List could be changed to an Entities class, containing the logic for a set of entities.

For example, my original draft of the Scores class had a Scores class, which was a Hash of Strings to Arrays of integers. When I reviewed this post, I realized that that version was exposing the collection type Array, so I rewrote it with ScoreList wrapping the Array class, as well as Scores wrapping the Map class.

As with primitives, another reason not to use collections directly is that because they are general purpose, they contain methods that very likely don’t apply to how you are using the collection. For example, a Java method returning ArrayList implies that all of the methods of ArrayList are valid, including methods to delete elements from the list.

This rule revolutionized my perspective, and is one that I wish I’d learned and begun practicing years ago. All code I’ve written since applying this is much cleaner and feels much more object-oriented than the legacy practice. Now it’s odd to work directly with a core collection. Maybe Java was right in putting their collections a bit out of the way, in java.util instead of java.lang, to discourage them from being used directly.

5. One dot per line

This is similar to the rule about the “depth” of method blocks, in essence keeping a single level of abstraction.

Applying this to the Ruby code above (Ruby tends to be “horizontal”, with methods chained) eliminates the @scores.values.flatten statements:

class ScoreList
  def average
    @scores.empty? ? nil : total / all_scores.length
  end

  def total
    all_scores.inject(0) { |sum, n| sum + n }
  end

  def all_scores
    scores.flatten
  end
end

It also tends to make one push the car.getEngine().setSpeed(55) chain into the appropriate classes, letting them do the necessary delegating.

Again, this fixes the problem of dealing with too many levels of abstraction at once, and of working through objects instead of on them.

6. Don’t abbreviate

I just saw a violation of this today, a method named “getPWD()”, to return a password. As a long-time Unix user, I read that as “print working directory”, a problem common with abbreviating: the user has have to understand the context, and switch their mind-set appropriately.

Another problem with abbreviations is that can become eclipsed. A piece of legacy code that I maintain is named “IM” (for interface module), obviously written prior to instant messaging, which claimed that abbreviation. (It is ironic that “eclipse” is a word that has — at least to many programmers — has its original definition eclipsed, thanks to the IDE.)

I also agree with the point that methods with long names tend to be misplaced. A common occurrence of that is when nouns are in the message name itself, such as:

    sendMessageToUser(message, user);

which should instead be

    message.sendTo(user);

In general, this rule reflects classic object-oriented design: classes should be nouns, and methods should be verbs.

The corollary for the no-nouns-in-methods subrule is that classes should not contain verbs (or nounified verbs), such as “UserManager” and “MessageSender”.

7. Keep all entities small

This rule seems extreme: no classes more than 50 lines, and no packages with more than 10 files.

I definitely like the concept, that large classes tend to become god classes, doing much more work than they should, instead of that behavior being pushed to the acted-upon classes. And I agree that long classes are difficult to read. Notice how IDEs collapse methods? That’s a sign of “fixing” a problem by hiding it.

My corollary to this is that classes should be some minimal length as well. If a class in a verbose language (Java, C++) is too short (10-20 lines or so), it is usually a C-style struct pretending to be a class (just with getX() instead of a public x). Thus it usually has little behavior defined within that class, putting the onus on client code to do the processing.

However, if this rule is applied rigorously, it could make the classes from the core libraries too simple. A string class certainly needs to have a full suite of behavior to be useful. java.lang.String has 69 methods, and the Ruby String class has 131 methods of its own, with another 45 inherited. A guideline is that code should be more concise and focused the closer it gets to the end user.

That said, I look at my projects with a simple metric: the number of lines per class. There should be a relatively narrow range among the line count, deviating little from the average. That is, a nice, narrow bell curve.

A bad sign — and seen all too often — are the projects with 75% of their files containing fewer than 50 lines of code, and 5% with more than 500 lines. Again, that indicates god classes.

Another problem with large classes is that they are simply too complex to test adequately, whereas small classes, with relatively self-contained behavior, are easy to test. And the simple law of economics applies: the lower the cost, the more the consumption. Thus the cheaper it is to write a test, the more that will be written.

8. No classes with more than two instance variables

This is probably the most bizarre rule of all of these, primarily because it goes against standard practice. However, if one applies the above rules, specifically about not using primitives and collections directly, it is much more feasible.

For example, a game could have a unit with the location and movement as:

public class Unit {
    private int x;
    private int y;
    private int direction;
    private int speed;
}

Applying this rule, we could define the Location and Velocity classes, replacing the above with:

public class Unit {
    private Location location;
    private Velocity velocity;
}

That’s another sign of good design, having variable names that closely match their types.

And as with the above, having the cohesively-defined Location and Velocity classes makes them easier to test, as well as adding behavior to them, such as range validation, such as that a location has to be within the coordinates of a rectangle (such as the screen).

9. No getters/setters/properties

An object should be “set” via its constructor, and low-level changes (meaning to its instance variables) thenceforth should be within the object itself.

Taking the example in rule 8, that would mean that Location could have a move(Velocity) method.

The principle here is that instead of directly manipulating an object (again, working in the object) the object should be “told” what to do, in the spirit of the original concepts of object-oriented programming, sending messages to an object.

The benefit of this rule is that it encourages thinking in terms of what the object does, instead of what the object has, that is, the “verbs” of the object instead of its “nouns”.

I highly encourage everyone to read the original, which is excellent in explaining the mapping from object-oriented principles to the programming practices.

If you liked this post — and who wouldn’t? — you might also like my series of posts about rewriting a class, using many of these rules.

Refactoring Exception Handling in Java

It seems that database and storage code are especially prone to be mere delegators, such as:

class DatabaseConnection {
    void write(String table, String key, Object value) {
        database.write(table, key, value);
    }
}

That wouldn’t be so painful if not for the number of exceptions in this type of code, and that the exceptions are often converted from one type (such as “DatabaseConnection” or “NetworkException”) to a different one (such as the more generic “StorageException”).

That leads to code much like in the following scenario, where we have a more generic Storage class that delegates to a database (via a DatabaseConnection). The idea of the Storage class is that it doesn’t need to be to a database, and could instead be to a filesystem or to memory. As described above, the Storage class throws the more generic StorageException, as well the PermissionException, which the DatabaseConnection also throws.

The database connection has the following (partial, obviously) definition:

class DatabaseConnection {
    public void connect(User user) throws DatabaseException, NetworkException, PermissionException {
    }

    public void write(String table, String key, Object value) throws DatabaseException, NetworkException, PermissionException {
    }

    public Object read(String table, String key) throws DatabaseException, NetworkException, PermissionException {
        return null;
    }

    public void disconnect() throws DatabaseException, NetworkException, PermissionException {
    }
}    

And our exceptions and the minimal User class are:

class DatabaseException extends Exception {
}

class NetworkException extends Exception {
}

class PermissionException extends Exception {
}

class User {
}

class StorageException extends Exception {
    public StorageException(Exception cause) {
        super(cause);
    }
}

The first implementation of the Storage class is the version so common that it seems that there must be a name for this type of write-a-delegate-by-copying-and-pasting (WADBCAP?).

class Storage {
    private final DatabaseConnection dbc;

    public Storage(User user) throws StorageException, PermissionException {
        try {
            dbc = new DatabaseConnection();
            dbc.connect(user);
        }
        catch (DatabaseException de) {
            throw new StorageException(de);
        }
        catch (NetworkException ne) {
            throw new StorageException(ne);
        }
    }

    public void write(String table, String key, Object value) throws StorageException, PermissionException {
        try {
            dbc.write(table, key, value);
        }
        catch (DatabaseException de) {
            throw new StorageException(de);
        }
        catch (NetworkException ne) {
            throw new StorageException(ne);
        }
    }

    public Object read(String table, String key) throws StorageException, PermissionException {
        try {
            return dbc.read(table, key);
        }
        catch (DatabaseException de) {
            throw new StorageException(de);
        }
        catch (NetworkException ne) {
            throw new StorageException(ne);
        }
    }

    public void disconnect() throws StorageException, PermissionException {
        try {
            dbc.disconnect();
        }
        catch (DatabaseException de) {
            throw new StorageException(de);
        }
        catch (NetworkException ne) {
            throw new StorageException(ne);
        }
    }
}    

The redundancy there is overwhelming, so let’s try to reduce it. From a Ruby perspective, it is obvious that using closures would fix this, but we’re in a version of Java (1.6) that doesn’t have closures, and a quick reading of this suggests that combining closures and checked exceptions will not be for the weak or fainthearted.

A closure is really just an anonymous (unnamed) method, which is impossible in Java, but there is a close approximation: an anonymous inner class. For the above code, we want some code that will delegate to our quasi-closure, but handle the exceptions for us. Ergo something like, where we’ll implement the execute method:

abstract class DatabaseStorageAction {
    public abstract Object execute() throws DatabaseException, NetworkException, PermissionException;

    public Object run() throws StorageException, PermissionException {
        try {
            return execute();
        }
        catch (DatabaseException de) {
            throw new StorageException(de);
        }
        catch (NetworkException ne) {
            throw new StorageException(ne);
        }
    }
}

An improved version of the above class would handle generic types, so that types other than Object could be expected from the execute method.

But it’s good enough for now, so let’s write a new version of the Storage class:

class StoragePartTwo {
    private DatabaseConnection dbc;

    public StoragePartTwo(final User user) throws StorageException, PermissionException {
        DatabaseStorageAction dsa = new DatabaseStorageAction() {
                public Object execute() throws DatabaseException, NetworkException, PermissionException {
                    dbc = new DatabaseConnection();
                    dbc.connect(user);
                    return null;
                }
            };
        dsa.run();
    }

    public void write(final String table, final String key, final Object value) throws StorageException, PermissionException {
        DatabaseStorageAction dsa = new DatabaseStorageAction() {
                public Object execute() throws DatabaseException, NetworkException, PermissionException {
                    dbc.write(table, key, value);
                    return null;
                }
            };
        dsa.run();
    }

    public Object read(final String table, final String key) throws StorageException, PermissionException {
        DatabaseStorageAction dsa = new DatabaseStorageAction() {
                public Object execute() throws DatabaseException, NetworkException, PermissionException {
                    return dbc.read(table, key);
                }
            };
        return dsa.run();
    }

    public void disconnect() throws StorageException, PermissionException {
        DatabaseStorageAction dsa = new DatabaseStorageAction() {
                public Object execute() throws DatabaseException, NetworkException, PermissionException {
                    dbc.disconnect();
                    return null;
                }
            };
        dsa.run();
    }
}    

It is arguable as to which is better in the above example, but as the number of exception types grows, and the pre- and post-closure code is more complex, the latter version is more justified.

A Day Refactoring Code, Part 4: Performance

This post is part of the series about refactoring code, and this post follows part 3.

With the functionality set, let’s focus on performance, since this class is heavily used, being at the core of DoctorJ. The main function in that regard is checkCurrentWord(), which we can immediately see uses a StringBuffer instead of the more performant StringBuilder.

But before we start doing any performance tuning, let’s get a rough estimate of how the current performance is. Running just TestParsingSpellChecker takes 14.735 seconds, and all 300+ tests take 27.098 seconds.

Here’s the outline of the method, in terms of how a word is extracted for checking:

    protected void checkCurrentWord() {
        StringBuffer word = new StringBuffer();
        word.append(currentChar());
        
        int startingPosition = this.pstr.getPosition();
        boolean canCheck = true;

        this.pstr.advancePosition();

        while (hasMore()) {
            char ch = currentChar();
            ... check by character ...
        }

        if (canCheck && this.pstr.getPosition() - startingPosition > 1) {
            checkWord(word.toString(), startingPosition);
        }
    }

Replacing StringBuffer with StringBuilder reduces the time to 11.457 seconds for the single test, although the time for all tests doesn’t change.

The most obvious problem with this code is that we are violating DRY with how the substring is extracted. The starting position is redundant with the word being built, so we can just remove the word buffer, and store the substring by its starting position, extracting it from pstr at the end of processing. (Note that java.lang.String is optimized for substring, with how it uses a char[] and offset).

We’ll need a new method in PString to get a substring at a position other than the current one, and that can be quickly coded and tested:

    /**
     * Returns a substring from the given position, of <code>num</code>
     * characters.
     */
    public String substring(int position, int num) {
        return this.str.substring(position, this.position + num);
    }

See the problem there? We have clashing variables, this.position and position. I’ve changed back and forth from _var to this.var, and am undecided.

Back to TestPSC, the new method gets significantly cleaned up. We have two integers, startingPosition and nChars, for when and if the substring needs to be extracted. the test now runs in 8.821 seconds.

Okay, those numbers are misleading, since that is just the Gradle execution time, which includes startup, compilation, etc., and from looking at the unit tests, there are no performance tests — the unit tests are mostly simple tests consisting of one or just a few lines of input data for the spell checker.

Let’s write a unit test, and get some data for it. We’ll use our own dictionary as input, getting a subset of its lines, and scrambling a few to be misspelled words:

    ~doctorj% perl -lane 'next if rand() < 0.95; $_ = reverse($_) if rand() < 0.2; print' etc/words.en_US > src/test/resources/wordlist.txt

Note that that goes into src/test/resources, which gets added to the build subdirectory by Gradle, and thus is in the classpath, to be read from our test code (with a couple of IJDK methods):

    public void testPerformance() {
        InputStream in = getClass().getClassLoader().getResourceAsStream("wordlist.txt");
        List<String> lines = InputStreamExt.readLines(in);

        // wrap as a Javadoc comment block:
        lines.add(0, "/** \n");
        lines.add("*/\n");

        String comment = StringExt.join(lines, "\n");
    }

First, let’s look at what the performance was with the old version of PSC, from 5 revisions back:

~doctorj% git checkout master~5 src/main/java/org/incava/text/spell/ParsingSpellChecker.java

With our test:

    public void testPerformance() {
        InputStream in = getClass().getClassLoader().getResourceAsStream("wordlist.txt");
        List<String> lines = InputStreamExt.readLines(in);

        // wrap as a Javadoc comment block:
        lines.add(0, "/** \n");
        lines.add("*/\n");

        String comment = StringExt.join(lines, "\n");

        long start = System.currentTimeMillis();
        
        tcsc.check(comment);

        long end = System.currentTimeMillis();

        long duration = end - start;
        System.err.println("# words: " + lines.size() + "; duration: " + duration);
    }

Alas, there’s nearly no difference:

previous code: # words: 7377; duration: 71273
new code: # words: 7377; duration: 69576

But it’s a slight improvement, and the code is much simpler. We also have a new class, PString, for usage in other scenarios in which a string is quasi-parsed.

Finally, we’ll disable our performance tests from the unit tests, check everything in, and look for the next thing to fix.

A Day Refactoring Code, Part 3: Naming

“Naming is the essence of programming.” -Fred Brooks, The Mythical Man-Month

Continuing from part 2 of this series, let’s now revisit our names, now that we have a better understanding of the behavior here.

PosString hasn’t grown on me as a name, and taking a bit of inspiration from the GString in Groovy, let’s rename it PString, which suggests “Processable String” and “Positionable String”. We’ll also improve the field names “pos” and “len” by renaming them as “position” and “length”. “str” is also quite short, but is a common name for a string, whereas “pos” and “len” are less common.

Going back to ParsingSpellChecker (which I’ll refer to as PSC), PString now has all of its low-level string processing functionality. There are methods in ParsingSpellChecker that seem more specific to its domain, such as skipLink().

The edge condition here would be a method such as skipThroughWord(), which advances to the next character that is not a letter or digit. As the descriptions of the methods get longer, they are more likely to be too specific for usage in a general class.

So that’s it for PString. Let’s start migrating ParsingSpellChecker to use it. First is that we’ll keep the instance fields as public, and just switch the
references:

            this.pstr = new PString(str);
            this.str = pstr.str;
            this.len = pstr.length;

Through the code we replace “this.pos” with “this.pstr.position”, so again the low-level behavior is the same, just with ParsingSpellChecker being highly coupled to PString, for now.

We’ll do likewise with the len/length fields, so now our PSC methods have code such as:

    protected void consumeTo(String what) {
        while (this.pstr.position < this.pstr.length && this.pstr.position + what.length() < this.pstr.length && !this.str.substring(this.pstr.position).startsWith(what)) {
            ++this.pstr.position;
        }
    }

We’re midstream on this task, but given our status (show by running the alias for “git diff –stat HEAD”) we see that we have 52 changed lines of code, which is a reasonable amount of code for a checkin:

~doctorj% gitdfs
 .../org/incava/text/spell/ParsingSpellChecker.java |   52 +++++++++-----------
 1 files changed, 23 insertions(+), 29 deletions(-)

We check it in and get back to work:

~doctorj% git commit -m "Began transition for ParsingSpellChecker to use PString." .
[master 95db8c1] Began transition for ParsingSpellChecker to use PString.
 1 files changed, 23 insertions(+), 29 deletions(-)

We remove this.str references, replacing them with “this.pstr.str”, then began “flipping” the methods in PSC to call their PString equivalents, such as this change (diff) for currentChar():

     protected Character currentChar() {
-        return this.str.charAt(this.pstr.position);
+        return this.pstr.currentChar();
     }

We make the transition a bit easier by renaming “consume()” to “consumeFrom()”, matching our PString method advanceFrom().

We go through the PSC code, replacing usage of PString fields with accessors and the equivalent mutators, such as in skipBlanks():

    // before

    protected void skipBlanks() {
        while (this.pstr.position + 2 < this.pstr.length && currentChar() != '<' && !this.str.substring(this.pstr.position, this.pstr.position + 2).equals("{@") && !Character.isLetterOrDigit(currentChar())) {
            ++this.pstr.position;
        }
    }

    // after

    protected void skipBlanks() {
        while (this.pstr.hasChar() && !this.pstr.isMatch("<") && !this.pstr.isMatch("{@") && !Character.isLetterOrDigit(currentChar())) {
            this.pstr.advancePosition();
        }
    }

That’s still the worst offender in terms of “width”, the number of characters wide a line is, which is a good way to find candidate code to be rewritten “skinnier”. (Other ways to find candidate code: by number of lines, and by depth of nested blocks, a variant of width).

In the next part of this series we’ll wrap it up with a look at performance and code quality.

A Day Refactoring Code, Part 2: Unit Testing

Continuing from part 1, we’ll begin writing test code from low-level functionality to higher-level, essentially paralleling the class as it grows in functionality as well.

It might seem like overkill to test very low-level code (such as getX()), but without low-level tests, a programmer may find himself analyzing the code at an improper level, mistaking for complex errors problems that are more fundamental.

Not testing low-level functionality led to a problem when I was prototyping a Javadoc parser (for DoctorJ). I wrote the original prototype (redundancy alert) in Ruby, then began migrating it to JRuby (intending to link it with Java code).

When I switched from Ruby to JRuby, there were failed unit tests, and I assumed that the problem was in my code, not in JRuby. (Yes, I adhere to the principle that “select ain’t broken”.)

Eventually, after much work, I discovered that a bug in the Regexp.last_match functionality in JRuby meant that the very high-level parsing functionality unit tests failed, and it took a lot of effort to drill down through the parsing code to discover that the issue was with the Regexp engine itself. I was thankful for the opportunity to contribute a fix for the issue, which was released with JRuby 1.6.5.

If I had had simpler tests, I probably would have found the issue sooner.

Back to unit testing in general, a good rule is to write tests in the order of the number of instance variables they test, either directly or indirectly. Thus the simplest tests would invoke methods that use only one instance variable (such as getX and setX). Subsequent tests should accrete in complexity, testing more instance variables.

This is essentially taking the concept from integration tests (multiple components) within unit testing itself, seeing variables (and sequences of methods and usage of multiple classes) as components as well. Thus unit testing begins with the smallest *unit* (variables), and expands to the point of being in the gray area where integration tests begin.

Back to this example, the next method to test is the one returning the character at the current position. Again we’ll write a custom assertion, then the test itself.

Implementing the tests such that it advances the position beyond the end of the string results in an exception:

        // out of range:
        str0.advancePosition(12);
        assertCurrentChar(null, str0);
java.lang.StringIndexOutOfBoundsException: String index out of range: 15
	at java.lang.String.charAt(String.java:694)
	at org.incava.text.PosString.currentChar(PosString.java:75)
	at org.incava.text.TestPosString.assertCurrentChar(TestPosString.java:36)
	at org.incava.text.TestPosString.testCurrentChar(TestPosString.java:84)

That shows that currentChar() should be updated to test that it is within bounds:

from:

        return this.str.charAt(this.pos);

to:

        return this.pos < this.len ? this.str.charAt(this.pos) : null;

The next test is for the hasMore method. What is important here is that we are testing the edge condition, for when pos is 1 less than len, is equal to len, and is 1 greater than it.

        str0.advancePosition();
        assertHasMore(true, str0);

        str0.advancePosition(12);
        assertHasMore(true, str0);

        str0.advancePosition();
        assertHasMore(false, str0);

        str0.advancePosition();
        assertHasMore(false, str0);

This seems to work … except that we didn't actually test the edge condition by hand. Why not? Laziness (false), where we just assumed that the string is 13 characters long.

Simplicity

This highlights another tenet of good unit testing: simplicity. The string we tested is so long ("this is a test" is my default string for string/regexp testing) that we didn't do the test "manually".

It's been a while that we've been rewriting the code, so we can consider checking this into our VCS, or we can just keep going.

When I'm on a multi-person project, I (nearly) never check in code that will fail the unit tests. In contrast, in my local environment, I find it easier to restart work after a break (especially when starting work in the morning) to have the project in the state of a failed unit test. This immediately gets me to the latest point of progress, and focuses my attention on the specific problem I was last working on. Buffer lists just aren't enough, since they only show me the latest files that I was working on, but not exactly what.

So we check in the code, then get back to the hasMore method. We made a mistake by, again, starting with overly complicated tests, in this case a string that was too long, so we'll go simple here, adding the field, and amending the method:

    private PosString abc;

    public void testHasMore() {
        assertHasMore(true, abc); // at 'a'
        // no change
        assertHasMore(true, abc);

        abc.advancePosition();  // at 'b'
        assertHasMore(true, abc);

        str0.advancePosition(); // at 'c'
        assertHasMore(true, abc);

        str0.advancePosition(); // off end
        assertHasMore(false, str0);
    }

Did you see the error? That was the first version that I wrote, which led to this failure:

junit.framework.AssertionFailedError: 'this is a test' (2) expected:<false> but was:<true>
	at org.incava.text.TestPosString.assertHasMore(TestPosString.java:44)
	at org.incava.text.TestPosString.testHasMore(TestPosString.java:123)    

‘this is a test’? This error was caused by my not doing a replace-all within the method, instead editing it “by hand”. (I’m convinced that copying and pasting, combined with search and replace, are responsible for 77.1% of software problems.)

We’ll fix our method:

    public void testHasMore() {
        assertHasMore(true, abc); // at 'a'
        // no change
        assertHasMore(true, abc);

        abc.advancePosition();  // at 'b'
        assertHasMore(true, abc);

        abc.advancePosition(); // at 'c'
        assertHasMore(true, abc);

        abc.advancePosition(); // off end
        assertHasMore(false, abc);
    }

Okay, that works, but should it? Is there “more” when the position is at the last character? I say no, so we change the hasMore code to call the method with num == 1:

    public boolean hasNumMore(int num) {
        return this.pos + num < this.len;
    }

We update the test:

        abc.advancePosition(); // at 'c'
        assertHasMore(false, abc);

Now the unit tests run without failure. Note that parameters to methods, and variables in general, have a length in characters proportionate to their scope. It's overkill to write "indexOfTheArray" in a small for-loop, whereas having instance variables named "td" are inadequately descriptive.

Writing the hasNumMore tests gives us a chance to use the Range class from the IJDK library, making for a bit terser and cleaner syntax:

    public void testHasNumMore() {
        for (int i : new Range(0, 10)) {
            assertHasNumMore(true, tenletters, i);
        }
    }

Here we're using another test string, "tenletters", which happens to be ten letters long. Fun, huh?

This is an excellent example of why custom assertions are valuable. Within a loop (checking a list of values), without a message for the assertEquals(), we would not know the value for which the assertion failed. But in this case, with the custom assertion, we get the output showing the value (10) that failed:

junit.framework.AssertionFailedError: pstr: ''tenletters' (0)'; num: 10 expected:<true> but was:<false>
	at org.incava.text.TestPosString.assertHasNumMore(TestPosString.java:51)
	at org.incava.text.TestPosString.testHasNumMore(TestPosString.java:131)

That actually shows that the test is incorrect: there are only 9 more characters from the initial position in the string. So we change the end of the range from 10 to 9 (that is, the range is 0 through 9), rerun, and the tests pass.

From the old class, we will pull the consume and consumeTo, amending their names and their code, starting with advanceTo, which is derived from the old consumeTo:

    public void advanceTo(String what) {
        while (this.pos < this.len && this.pos + what.length() < this.len && !this.str.substring(this.pos).startsWith(what)) {
            ++this.pos;
        }
    }

    public void testAdvanceTo() {
        assertCurrentChar('a', abc);
        abc.advanceTo("b");
        assertCurrentChar('b', abc);
        abc.advanceTo("c");
        assertCurrentChar('c', abc);
        abc.advanceTo("d");
        assertCurrentChar(null, abc);
    }

That raises another question: what happens when we advance to the end of the string? Should the current character be the last one, or should it be null?

We'll amend it to be null, and clean up the code, which has too long of a conditional (and redundant, with the "pos < len && pos + n < len" statements,
adding a new method, isMatch:

        while (this.pos < this.len && !isMatch(what)) {
            this.pos++;
        }

    public boolean isMatch(String what) {
        return this.pos + what.length() <= this.len && this.str.substring(this.pos).startsWith(what);
    }

Bringing over the old consumeFrom method gives us a chance to use the new isMatch method, for clearer code:

    protected boolean consume(String what) {
        skipBlanks();
        if (this.pos + what.length() < this.len && this.str.substring(this.pos).startsWith(what)) {
            this.pos += what.length();
            return true;
        }
        else {
            return false;
        }
    }

    public boolean advanceFrom(String what) {
        if (isMatch(what)) {
            advancePosition(what.length());
            return true;
        }
        else {
            return false;
        }
    }

This is in keeping with the principle of having fewer dots per line. more dots indicates a violation of the Law Of Demeter, with excessive low-level access to instance variables.

Notice that this method, as we've advanced through the class, is using no instance variables directly. Essentially the class is "coming up" from a low-level implementation closer to the point of how it will be used.

We'll bring over consumeFromTo, renaming it advanceFromTo. Again, we have no instance variables directly accessed:

    public void advanceFromTo(String from, String to) {
        if (advanceFrom(from)) {
            advanceTo(to);
        }
    }

We got all the way to here, and from looking at the code, we have an edge condition issue, as to what hasMore means. Does it mean "has more characters remaining to the end of the string", or does it mean "has a current character", that is, is not at the end?

Following the principle of when in doubt, copy, we'll imitate the behavior of the Java collections, specifically Iterable, for which hasNext() means "has a current element", and next() takes it.

So hasMore() will be amended to mean "has a current character", and will be renamed to hasChar(), which is less ambiguous:

    public boolean hasChar() {
        return this.pos < this.len;
    }

Notice that we slipped isMatch into the code, without adding tests for it, so let's fix that now, moving the tests before those for the advanceFrom/To methods, since isMatch is lower level, that is, is used by advanceFrom and advanceTo.

The implementation of assertIsMatch quotes the string. Against this can be useful, in cases where the string would be a whitespace character.

        String msg = "pstr: " + pstr + "; str: '" + str + "'";
        assertEquals(msg, exp, pstr.isMatch(str));

Repetition in Tests

We write the advance* methods, and end up with this one for testAdvanceFromTo():

    public void testAdvanceFromTo() {
        assertCurrentChar('a', abc);
        abc.advanceFromTo("a", "c");
        assertCurrentChar('c', abc);

        assertCurrentChar('t', tenletters);

        // no change
        tenletters.advanceFromTo("e", "n");
        assertCurrentChar('t', tenletters);

        // to first e
        tenletters.advanceFromTo("t", "e");
        assertCurrentChar('e', tenletters);
        assertPosition(1, tenletters);

        // to second e
        tenletters.advanceFromTo("e", "e");
        assertCurrentChar('e', tenletters);
        assertPosition(4, tenletters);

        // to second t
        tenletters.advanceFromTo("e", "t");
        assertCurrentChar('t', tenletters);
        assertPosition(5, tenletters);

        // to third t (next character)
        tenletters.advanceFromTo("t", "t");
        assertCurrentChar('t', tenletters);
        assertPosition(6, tenletters);
    }

By using the higher-level (custom) assertions the code is much cleaner, but there is obvious repetition, so we will extract the common code into a method that executes a subset of behavior and assertions.

    protected void doAdvanceFromToTest(Character expChar, int expPos, PosString pstr, String from, String to) {
        pstr.advanceFromTo(from, to);
        assertCurrentChar(expChar, pstr);
        assertPosition(expPos, pstr);
    }

Our testAdvanceFromTo method is now much cleaner:

    public void testAdvanceFromTo() {
        assertCurrentChar('a', abc);
        doAdvanceFromToTest('c', 2, abc, "a", "c");

        assertCurrentChar('t', tenletters);

        // no change
        doAdvanceFromToTest('t', 0, tenletters, "e", "n");

        // to first e
        doAdvanceFromToTest('e', 1, tenletters, "t", "e");

        // to second e
        doAdvanceFromToTest('e', 4, tenletters, "e", "e");

        // to second t
        doAdvanceFromToTest('t', 5, tenletters, "e", "t");

        // to third t (next character)
        doAdvanceFromToTest('t', 6, tenletters, "t", "t");

        // to end of string
        doAdvanceFromToTest('s', 9, tenletters, "t", "s");

        // off end of string
        doAdvanceFromToTest(null, 10, tenletters, "s", "j");
    }

A simple rule of economics is as cost decreases, demand (usage/consumption) increases, and that rule applied to testing is that as tests are easier (and more fun, frankly) to write, people will write more tests.

Note in the example above, with the low-level functionality and testing put into a separate method, the test code can focus on the edge conditions, such as “what happens if I’m on the same string as the one I want to advance to?” and “what happens when one advances to the end of the string?”

Edge conditions are where code will fail. Beware of thinking that more assertions are better than fewer; quality of assertions is better than sheer mass. (And “Sheer Mass” would be a good name for a rock band.)

In the next post, we’ll discuss naming, “the essence of programming”.

A Day Refactoring Code, Part 1: Rewriting

This post continues the overview of code being refactored.

We’ll pull the str, pos, len fields and build a new class around them. Naturally, the first step is writing a unit test, beginning by testing the lowest-level functionality, which set, get, and advance. That low-level functionality is only an intermediate step in getting the class we want, since eventually we’ll have some more expansive behavior.

So for now those methods are public, but we’ll probably eliminate them or reduce their accessibility, thus encouraging usage of higher-level methods instead.

For each method we’ll be testing, we will have a custom assertion, an assertion that essentially wraps the low-level assertEquals with a more expressive message associated with it.

Custom assertions should generally follow the form:

    public void assertMethodName(VType expectedValue, OType objTested, Object ... methodArguments) {
        String msg = "obj: " + objTested + "; arguments: " + methodArguments;
        assertEquals(msg, exp, objTested.methodName(methodArguments));
    }

for example, here is the custom assertion for testing substring, which takes an
integer argument:

    public void assertSubstring(String exp, PString pstr, int pos, int num) {
        String msg = "pstr: " + pstr + "; pos: " + pos + "; num: " + num;
        assertEquals(msg, exp, pstr.substring(pos, num));
    }

This produces the output:

junit.framework.ComparisonFailure: pstr: 'abc' (0); pos: 1; num: 1 expected:<b> but was:<b>
	at junit.framework.Assert.assertEquals(Assert.java:85)
	at org.incava.text.TestPString.assertSubstring(TestPString.java:44)
	at org.incava.text.TestPString.testSubstringWithPosition(TestPString.java:110)

In comparison, unit tests that use the low-level assertEquals such as the follow result in the subsequent unit test output:

    public void assertSubstring(String exp, PString pstr, int pos, int num) {
        assertEquals(exp, pstr.substring(pos, num));
    }
junit.framework.ComparisonFailure: expected:<b> but was:<b>
	at junit.framework.Assert.assertEquals(Assert.java:85)
	at junit.framework.Assert.assertEquals(Assert.java:91)
	at org.incava.text.TestPString.assertSubstring(TestPString.java:48)
	at org.incava.text.TestPString.testSubstringWithPosition(TestPString.java:114)

In the first example, we know the position (1), number (1), and information about PString itself (its wrapped string is “abc”, and it’s at position 0).

In the next post, we’ll delve more deeply into the unit tests.

A Day Refactoring Code, Part 0: Overview

I’ve been rewriting much of DoctorJ, which has been suffering some code rot after several years of existence, now at major release 5.

While poking through the code, I found the class ParsingSpellChecker, which as its name implies, spell-checks, in particular working on Javadoc comment blocks, such as skipping <pre> blocks and "{@link" tags:

package org.incava.text.spell;

import java.io.InputStream;
import java.util.*;
import org.incava.ijdk.util.MultiMap;

public class ParsingSpellChecker {
    /**
     * Whether we can spell check, which we can't until we get a word or a
     * dictionary (which is a thing with a lot of words).
     */
    private boolean canCheck;

    /**
     * The spell checker, which doesn't parse.
     */
    private final SpellChecker checker;

    /**
     * The current string we're working on.
     */
    private String str;

    /**
     * The length of the current string.
     */
    private int len;

    /**
     * The current position within the string.
     */
    private int pos;
    
    public ParsingSpellChecker(SpellChecker checker) {
        this(checker, false);
    }

    public ParsingSpellChecker(SpellChecker checker, boolean canCheck) {
        this.checker = checker;
        this.canCheck = canCheck;
    }

    public boolean addDictionary(String dictionary) {
        this.canCheck = this.checker.addDictionary(dictionary) || this.canCheck;
        return this.canCheck;
    }

    public boolean addDictionary(InputStream dictionary) {
        this.canCheck = this.checker.addDictionary(dictionary) || this.canCheck;
        return this.canCheck;
    }

    public void addWord(String word) {
        this.checker.addWord(word);
        this.canCheck = true;
    }

    public void check(String str) {
        if (this.canCheck) {
            this.str = str;
            this.len = this.str.length();
            this.pos = 0;
    
            while (hasMore()) {
                skipToWord();
                if (hasMore()) {
                    if (Character.isLetter(currentChar())) {
                        checkCurrentWord();
                    }
                    else {
                        // not at an alpha character. Might be some screwy formatting or
                        // a nonstandard tag.
                        skipThroughWord();
                    }
                }
            }
        }
    }

    /**
     * Consumes from one string to another.
     */
    protected void consumeFromTo(String from, String to) {
        if (consume(from)) {
            consumeTo(to);
        }
    }

    /**
     * Skips content between pairs of brackets, a la HTML, such as, oh,
     * "<html>foobar</html>." Doesn't handle slash-gt, such as "<html
     * style="lackthereof" />".
     */
    protected void skipBracketedSection(String section) {
        consumeFromTo("<" + section + ">", "</" + section + ">");
    }

    protected void skipLink() {
        consumeFromTo("{@link", "}");
    }
    
    protected void skipBlanks() {
        while (this.pos + 2 < this.len && currentChar() != '<' && !this.str.substring(this.pos, this.pos + 2).equals("{@") && !Character.isLetterOrDigit(currentChar())) {
            ++this.pos;
        }
    }
    
    protected void skipToWord() {
        skipBracketedSection("code");
        skipBracketedSection("pre");
        skipLink();
        consume("&;nbsp;");
    }

    protected void checkWord(String word, int position) {
        MultiMap<Integer, String> nearMatches = new MultiMap<Integer, String>();
        boolean valid = this.checker.checkCorrectness(word, nearMatches);
        if (!valid) {
            wordMisspelled(word, position, nearMatches);
        }
    }

    protected void wordMisspelled(String word, int position, MultiMap<Integer, String> nearMatches) {
        int nPrinted = 0;
        final int printGoal = 15;
        for (int i = 0; nPrinted < printGoal && i < 4; ++i) { // 4 == max edit distance
            Collection<String> matches = nearMatches.get(i);
            if (matches != null) {
                for (String match : matches) {
                    System.out.println("    near match '" + match + "': " + i);
                    ++nPrinted;
                }
            }
        }
    }

    protected boolean consume(String what) {
        skipBlanks();
        if (this.pos + what.length() < this.len && this.str.substring(this.pos).startsWith(what)) {
            this.pos += what.length();
            return true;
        }
        else {
            return false;
        }
    }

    protected void consumeTo(String what) {
        int len = this.str.length();
        while (this.pos < len && this.pos + what.length() < len && !this.str.substring(this.pos).startsWith(what)) {
            ++this.pos;
        }
    }

    protected Character currentChar() {
        return this.str.charAt(this.pos);
    }

    protected boolean hasMore() {
        return this.pos < this.len;
    }

    protected void checkCurrentWord() {
        StringBuffer word = new StringBuffer();
        word.append(currentChar());
        
        int startingPosition = this.pos;
        boolean canCheck = true;

        ++this.pos;

        // spell check words that do not have:
        //     - mixed case (varName)
        //     - embedded punctuation ("wouldn't", "pkg.foo")
        //     - numbers (M16, BR549)
        while (hasMore()) {
            char ch = currentChar();
            if (Character.isWhitespace(ch)) {
                break;
            }
            else if (Character.isLowerCase(ch)) {
                word.append(ch);
                ++this.pos;
            }
            else if (Character.isUpperCase(ch)) {
                skipThroughWord();
                return;
            }
            else if (Character.isDigit(ch)) {
                skipThroughWord();
                return;
            }
            else {
                // must be punctuation, which we can check it if there's nothing
                // but punctuation up to the next space or end of string.
                if (this.pos + 1 == this.len) {
                    // that's OK to check
                    break;
                }
                else {
                    ++this.pos;
                    while (hasMore() && !Character.isWhitespace(currentChar()) && !Character.isLetterOrDigit(currentChar())) {
                        // skipping through punctuation
                        ++this.pos;
                    }
                    if (this.pos == this.len || Character.isWhitespace(currentChar())) {
                        // punctuation ended the word, so we can check this
                        break;
                    }
                    else {
                        // punctuation did NOT end the word, so we can NOT check this
                        skipThroughWord();
                        return;
                    }
                }
            }
        }

        // has to be more than one character:
        if (canCheck && this.pos - startingPosition > 1) {
            checkWord(word.toString(), startingPosition);
        }
    }

    protected void skipThroughWord() {
        ++this.pos;
        while (hasMore() && !Character.isWhitespace(currentChar())) {
            ++this.pos;
        }
    }
}

The problem with ParsingSpellChecker is that the low-level functionality (going through a string by “position” (current index))and checking against string length) don’t match the more high-level behavior of the class.

Another red flag in the class is there are three instance fields (len, pos, and str) that are very tightly connected, with essentially each of them having to be used with one another.

It is also telling that the indentation levels within the class are very deep, also indicating poor abstraction. The fewer indentation levels, the more likely it is that the code is evenly abstracted at the proper level.

Code repetition is another issue, with statements such as this.pos++ found in many locations.

From running PMD, we know that StringBuilder is preferred over StringBuffer for performance reasons. Similarly, by building its own substrings (via StringBuffer.append()), the code does not take advantage of String.substring, which uses offsets in the newly-created String to refer to a substring within the lower-level char array.

That’s the overview. In the next post, we’ll begin rewriting this mess.