Glark 1.10.0

Glark 1.10.0 is ready to be released, after a couple of years of being not a high priority for me. I was inspired to rewrite it when I looked through the code, much of it written early in my Ruby days. It began as a Perl script, and retained that scriptitude through its life.

At times while rewriting Glark, I wish that I’d blogged that experience. The short description is that I had a few primary guidelines:

  • Test as thorough as possible, ideally one test (at least) per feature. A feature is essentially the same as an option. Each source file/class should have an equivalent test case.
  • Keep files and classes small, and relatively even in size.
  • Simplify the set of options.
  • Eliminate global variables.
  • Add one feature per day.

Following these principles resulted in a code base I am much more satisfied with.

Previously much, if not most, of Glark was “field-tested”, a euphemism for “I tried it out a while ago, and I think it worked then.” As the test suite grew, the code became much easier to refactor with confidence.

Regarding the size of files and classes, I used a simple metric:

% wc lib/**/*.rb | sort -rn

And then I usually tackled what was at the bottom of the list.

The average file is now 59 lines long, with the largest being 201 lines, and the smallest, 10 lines. In the previous implementation, the smallest file was 102 lines, the largest, 761 lines, and the average, 328 lines.

Option processing was the major chunk of code tangled through the code base, primarily because there was a single Options class, a singleton used essentially everywhere throughout the code. I first split that into the subsets of options, such as those for the input options, for matching, and for output, with their equivalent submodules using only those option objects instead of the global/singleton.

This was further cleaned up by removing the option processing from what I eventually labeled their “specs”, the values that determined the behavior of the submodule. One idea that I had is that eventually those specs could be passed in from outside of Glark itself, for usage by external programs, such as PVN.

Adding one feature per day, which I’ve written about previously, motivated me to do some non-coding things, mainly writing documentation. I’ve realized that much of the distinguishing functionality of Glark hasn’t been well documented, and the man page has now increased from 927 lines to 1126.

It’s been an interesting evolution of the behavior of Glark, as much of its early functionality has been added into grep, such as colorized matches and context around them. Elaborating on what I wrote in the readme/man page:

Glark extends grep by matching complex expressions, such as “and”, “or”, and “xor”. This is useful in a case such as “I’m looking for “foo” and “bar”, within three lines from each other.” It can be infinitely complex, such as, also from the man page: “glark –and=5 Regexp –xor parse –and=3 boundary quote”, meaning: (within 5 lines of each other: (/Regexp/ and (/parse/ xor (within 3 lines of each other: /boundary/ and /quote/))))

Glark handles file, directory, and path arguments, optionally recursing directories to a certain depth, and processing path arguments as a set of files and directories. .svn and .git subdirectories are automatically excluded.

I realize, with a bit of guilt, that that defies the Unix principle of keeping programs small, and with minimal overlapping functionality, since much of that is already done by the “find” command. However, some of this behavior was included in early Glark, and grep itself has the “-r” option for recursing directories, so I wanted to extend that to be more advanced, in part because when running Glark on Windows systems, there is no “find” command.

Binary files are excluded (by default), but can, in the case of compressed or archived files, have their extracted contents be searched.

This rolls into Glark behavior that I’d wanted for a while, mainly for searching Jar files for class names, which I previously did via shell scripting, such as:

for i in *.jar; do jar tf $i | glark --label=$i Exception; done

That now can be done with:

glark --binary-files=list Exception *.jar

Glark can use a per-project configuration file, so different projects can have their own Glark parameters, such as different files to include and exclude for searching, and different colors for pattern highlighting. My goal there is that it can add feedback for when one is working in different projects, such as highlighting matches in Java code differently than in Ruby. Colorizing is still
only on a per-project basis, not on the file type itself, which I’m considering adding, since it might be helpful to distinguish matches in Ant build files from those in Gradle.

I am doing some final testing, and then the Ruby Gem should be available. I am looking for maintainers to repackage Glark as an RPM and a Debian package, although I will probably release unofficial packages for those within a few days as well.

Stati of Projects

I’ve been jumping around from one project to another on an as-needed basis, and here is the current status of each.

RIEL – I’ve been updating the colorizing code, adding the functionality to use extended color codes on ANSI terminals, instead of the default 10.

Glark – In the midst of a major rewrite, for both code purity and functionality. The main changes so far are extensions to the path/file arguments, bringing in some functionality from find. Glark will also (with the imminent integration of RIEL) support extended colors.

PVN – This project was in heavy development until a month ago, when it went into a state of waiting for updates to Glark, since it will be using Glark for its seek (searching) subcommand.

DiffJ – This project I rewrote in JRuby, but that was a little too slow for a command-line application –  even heavily tweaked, I couldn’t get the startup time under two seconds. So I re-rewrote it in Java, and intend to revisit it to add more intelligent code comparisons, such as understanding that “if (a == b) foo();” is the same as “if (a == b) { foo(); }”

IJDK – Mostly dormant at this point. When I’ve rewritten Java projects, I’ve tried to extract the generic code from them and add them to IJDK, but I haven’t been heavily involved with any of my Java projects lately.

Java-Diff – A while ago this was brought up to date to use generics, and was refactored for code clarity. There hasn’t been any reason to update it since then.

DoctorJ – Alas, this is dormant. Some of its warnings have been integrated into the Java compiler, such as mismatched parameter names. It still goes beyond that level of pedanticalness, so it is most suitable in the development of projects where documentation is paramount, such as APIs.

Related posts:

Rewriting Glark

For the PVN project, I’ve wanted to use Glark as a library for matching text (for the ‘seek’ subcommand), but I’d written Glark as a command-line application, and its design reflected that. It also, like so many “field-tested” programs, had very few tests, with the expectation that because
of being heavily used, flaws would easily surface. That’s relatively accurate: I’m fairly sure that I use Glark more than any other program not built into Unix (I even have grep aliased to “glark -g”).

I was once asked in a job interview what my opinion was of my own code. My response was that my code evidently sucks, because I’m always rewriting it. That was definitely true with Glark, being one of my first Ruby programs, migrating it from Perl, and not having touched it in a long time.

The problems with Glark were classics of bad programming: lack of tests, overly complicated code, use of global variables, poor class composition, and excessive coupling.

So I’ve been eager to rewrite Glark, but without tests, a program is much too brittle, so I knew that I’d first have to add tests. I simple can’t enjoy writing code without tests. However, with a comprehensive test suite, rewriting code is bliss. So I first wrote a few tests, then refined my test framework to the point that writing a unit test is as simple as this:

def test_simple
  fname = '/proj/org/incava/glark/test/resources/textfile.txt'
  expected = [
              "    3   -rw-r--r--   1 jpace jpace   45450 2010-12-04 15:24 02-TheMillersTale.txt",
              "   10   -rw-r--r--   1 jpace jpace   64791 2010-12-04 15:24 09-TheClerksTale.txt",
              "   20   -rw-r--r--   1 jpace jpace   49747 2010-12-04 15:24 19-TheMonksTale.txt",
              "   24   -rw-r--r--   1 jpace jpace   21141 2010-12-04 15:24 23-TheManciplesTale.txt",
             ]
  run_app_test expected, [ '--xor', '\b6\d{4}\b', 'TheM.*Tale' ], fname
end

That’s Glark matching 6nnnn ^ TheM*Tale. At this point, grep has added some of what made Glark quite distinct from it — highlighted/colorized matches and context — but Glark’s most unusual (and fun to program) feature is matching of compound expressions, such as:

% glark --and=3 write --or puts print **/*.rb

That is matching “write” within 3 lines of puts or print.

Do you need that very often? Nope. But it does come in handy, such as in the case of “I’m looking for where we are catching an InvalidArgumentException and logging it (within the next 5 lines) as an error:

% glark --and=5 'catch.*InvalidArgumentException' 'Log.error' **/*.java

Speaking of interviews, a friend of mine has a good practice of when he goes to a company to assess their software, he asks to see what they consider their worst code. Often that code is at the core of their project and is the oldest code, written early on by someone who may have left the company, and/or the code has been piled on with more and more complexity that it is difficult to detangle.

In Glark, the worst code of the entire application was the Options class, clocking in at 761 lines long, containing the 42 options in Glark. This class is a Singleton, which is the fancy Design Pattern way of saying Global Variable. (The worst by-product of the Gang of Four was the sanctifying of Singletons as being a good practice.)

Another sign that the Options class was written poorly is an insanely simple metric: wc. That is, running the “wc” command on all files, sorting them numerically, and looking at the largest files. There are the bottom, in all its corpulent glory:

% wc **/*.rb | sort -rn
    4     7   160 lib/glark.rb
  102   654  5213 lib/glark/help.rb
  183   527  4569 lib/glark/input.rb
  248   640  6064 lib/glark/exprfactory.rb
  266   681  6052 lib/glark/output.rb
  297   777  7392 lib/glark/glark.rb
  440  1048  9663 lib/glark/expression.rb
  761  2615 23377 lib/glark/options.rb
 2301  6949 62490 total

The Options class is used throughout Glark, so extracting it was quite challenging. I decomposed the Options class into smaller groups, and it just so happened that there was a design to follow, documented in, of all places, the help (man) page for Glark. That is, because there are so many options, for legibility and organization they are displayed in the man page in the groups “input”, “matching”, “output”, and “debugging/errors”. So I repackaged the Glark options into input, match, output, and info, and also used that as the organization for the modules within Glark. Thus the Glark::File class went into lib/glark/input/file.rb, and Grep::Lines went into lib/glark/output/grep_lines.rb.

I continued to refine the tests to the point that adding new ones was trivial. As the test coverage increased, this had the effect of making it an aberration when I worked on untested code.

On that note, here’s an easy way to test your test. That is to determine if your test really works, break the code that it is testing. (A “return nil if true” at the beginning a method works nicely.) If the test still passes, then it’s incomplete. If the test fails, then should add confidence that test coverage is adequate. This is also where it becomes fun to break code, and to break tests. As with anything, if it’s fun, we’ll do more of it, which is why it is essential to have a test framework that makes tests easy (ergo, fun) to write.

For years I’ve tracked my daily progress by the simple metric of lines of code, but after hearing this suggested on the Ruby Rogues podcast, I’ve begun the practice of adding one user-facing feature per day, such as a new subcommand or option to PVN. My definition of “feature” includes adding and refining documentation, and it also includes removing options, especially if they are confusing, redundant, unused or obsolete.

I track features with a Features.txt in the root directory of each project, of the form (from PVN):

Thu Oct 25 19:22:52 2012

  seek command: added [ -C --no-color ] option.

This keeps me on track by actually recording features that are added. A script I run on all {project}/Features.txt files shows whether I’ve added one for today, and when I’ve missed a day (only one since I started doing this).

My process for adding features feels like an extension of the TDD process, in short:

  • conceive of a feature
  • add it as a test
  • implement it
  • run tests
  • refactor the tests and code …
  • document the feature in the readme and help files
  • add the feature to the features file
  • commit with the feature description as the comment description.

That’s about it for this update on the coarse rewriting of Glark. You can track the progress of it on GitHub (https://github.com/jeugenepace/glark), and if you want to see the code before the rewrite, check out revision 4d10f192f46ec3df34f971f8b40e03f8df0aed27.

Searching Companion Files

This little ditty (there are no large ditties) looks for a file in the same directory as another file, and then searches that file for the string “desc”, using glark. Note that this is in Z shell.

% for i in **/foo.xml; do n=`dirname $i`/bar.txt; glark --label=$n desc $n; done