There’s a lot of old folk wisdom out
there that justifies the use slow languages and runtimes on the basis
that the impact on performance doesn’t matter. Here’s a sampling of
them:

  • Most applications are I/O bound so
    the performance of the programming language doesn’t matter

  • The database does all the heavy
    lifting, so performance of the application code doesn’t matter

  • Computer time is cheap compared to
    programmer time

Like most folk wisdom, these all have
an element of truth. They are often brought up in discussions about
the merits of strongly-typed compiled languages versus dynamic
languages, and natively compiled languages versus virtual machine
based languages versus interpreted languages. I could write about
any one of them, and many more, but today I’ll address I/O because of
the “work” I’ve been doing on the WideFinder 2 project.

WideFinder 2

The original WideFinder project was, at
least on first inspection, quite clearly I/O bound (well, if you had
an input file smaller than your main memory…) The WideFinder 2 is
a little better because it is a little more computationally complex,
but not by much. The benchmark is really simple: process a big (42
GB) web server log file and report some basic statistics. In order
to perform the benchmark, the program must parse the file, store
information from each line on several maps, and finally extract some
statistics out of them.

Tim benchmarked
sequential block input on the test system at just shy of 150M/s. This
means that the I/O alone required to process the 42GB file should
take about 5 minutes, meaning that if the benchmark truly is I/O
bound then the program shouldn’t take much more than 5 minutes to
run. Consequently, if we are to judge based on Tim’s reference
implementation
of the WF2 in Ruby then it quite clearly isn’t I/O
bound.

I/O Takes CPU, too

If you look closely at the Bonnie
benchmark results you’ll notice that doing raw block I/O – meaning
just reading files into a buffer – consumes almost 80% of the a
single core of the CPU. That means that I/O alone comes pretty close
to being bound by the CPU as well as being bound by the disk. In
fact, experimentation uncovered that placing the system high load
reduces
maximum I/O throughput
. In order to achieve maximum throughput,
you actually have to ensure that you keep one core free to manage the
I/O.

Application I/O is more than I/O

Another key factor is that what most
application programmers think of as I/O involves a lot more than
shuttling bits from disk into memory. For example, the Java and
other unicode
based platforms have to decode
the character encoding of the file into the native character encoding
of the runtime. In the case of the JVM, this not only requires that
every byte be processed, but also frequently doubles the memory
consumption of each character and requires the data be copied into a
separate buffer. Furthermore, application code usually deals with
files line-by-line in the form of some standard (often immutable)
string
object
, thereby requiring yet another pass over the data. So
when your code is simply iterating over lines in a file, it isn’t
really just doing I/O. A fair amount of CPU/memory bound work is
required to deliver those nice, neat, unicode strings.

Large datasets change the rules

Memory management isn’t free, and even
with an industrial strength garbage collector it isn’t always simple.
A common practice with immutable objects is to have them share some
internal state in order to avoid excess data copying. For example,
when you take a substring of a Java string, the two string objects
share an underlying character array. Often times this saves memory,
and it changes generating a substring of a string from a linear time
and space operation (with respect to string length) to a constant
time and space operation. Unfortunately if the substring is much
longer lived that the original string, and especially if it is much
smaller, you end up with something that feels awful lot like a memory
leak (I think it could be debated whether it is a leak or not).

Even if you’re not carrying around a
lot of excess baggage in the form of substrings, processing gigabytes
of data consumes a lot of memory. In the case of WF2, a little bit
of pretty much every line needs to be stored for the calculations at
the end. The cast majority of my “debugging” time for WF2 was
spent figuring out what JVM parameters will actually allow the
program to finish without slowing to a crawl (pre-tuning there were
points were it would spend 90% of its time in GC) or consuming every
bit of memory available (at least a few WF2 solutions hog tons of
memory).

All this means that when dealing with
large files other parts of the system need to do more work, which
takes away time that could be spent on “real” processing or on
I/O (which we’ve already seen can be a CPU hog). Furthermore, I/O
routines and routines commonly used along side heavy I/O (like
regular expressions) must be very careful about their memory usage.

So is WF2 really I/O bound?

It depends. Go take a good hard look
at the WF2
results page
. If you look at Tim Bray’s results you would
conclude that no, WF2 clearly is not I/O bound. However, if you look
at the top you’ll see some very impressive numbers that indicate that
WF2 indeed is I/O bound (side note: I really need to take a hard
look at OCaml). Of course, you
could argue that even in the OCaml case it really is CPU bound,
because making the task take about as long as the required I/O
saturated 8 cores and 32 hardware threads. Scanning down the list
would seem to indicate that in most cases I/O does not represent a
significant bottleneck, but then it’s hard to really tell. The disk
may not be the bottleneck, but the I/O routines
within the libraries or runtime may be. Consequently, from the
application programmer perspective, WF2 is I/O bound.

Redefining “I/O
bound” and future impacts

For a long time “I/O bound”
primarily referred to hardware or possibly operating system
limitations. That was a useful definition, but it is time for it to
change. Most software being developed today has a very tall stack
of abstractions sitting between it and the hardware. Operating
systems schedule I/O and have to split limited resources among many
competing processes. Virtual machines sit on top of operating
systems, isolating the programmer from the underlying OS and
hardware. Libraries and frameworks isolate the programmer from the
virtual machine and even other libraries and frameworks. I/O from
the programmer’s perspective is, or at least should be if he is
working on top of good abstractions, that I/O is “everything that
happens between when my program requests some data and when it
receives it.” Consequently, libraries and runtimes should go to
great lengths to ensure that being I/O bound is as close to its
original meaning as possible. Prior to multicore systems becoming
pervasive that was largely true, but today’s I/O libraries fail to
take advantage of them, and consequently force I/O into being a
bottleneck when it should not be.

That’s why in my Widefinder
submissions I’ve worked to separate parallelized I/O into a
library-like module. It quite clearly is not done, but it is
relatively successful. I reused my the parallel I/O code that I
developed for WF1 on WF2 without changing any of the logic. It
doesn’t provide many features, and it could use a fair amount of
optimization and simplification (the line joining logic is an
atrocity), but it works.

Read More

ترك الرد

من فضلك ادخل تعليقك
من فضلك ادخل اسمك هنا