Search:

Binary Data

gawk is a scripting language for text processing. That means, it's best when it runs on text input. Read text as "human readable text": numbers, characters, punctation etc. But gawk can also work with binary data, especially strings including the "nasty" ASCII 0-character which is used in C programs to terminate character strings. Reading binary data is a little bit more difficult.

The following simple script works on any input data, including binary data like executable programs and counts characters, words and lines, similiar to wc.

#!/usr/bin/gawk -f
#

#
# wc.awk - count lines, words and charcters.
#

BEGIN {
	while (getline > 0) {
		lines++;
		chars = chars + length($0) + length(RT);
		words = words + split($0, x, /[[:space:]]+/);
		}

	printf ("%d %d %d\n", lines, words, chars);
	exit (0);
	}

The character counting (chars variable) takes into account that the newline character is significant and must be counted. Hence it adds the length of the "magic variable" RT to the total number of characters (or bytes). Notice that gawk reads still newline terminated lines in this example. But gawk doesn't assume or require that the line's characters are "human readable text".

The way the script above counts words is different from wc's understanding of words, so the result will not match. Additionally the script above counts the last line in the file always with one, regardless if it's terminated with a newline.

Having said this I consider the following two results as equal within the given limitations.

# wc /usr/bin/gawk
   1756   12758  634303 /usr/bin/gawk
# ./wc.awk /usr/bin/gawk
1757 12863 634303

Reading a given number of bytes

Binary data may included any character. There is no control character (a newline) telling that the end of a data record (an input line) is reached. Using a length indicator to tell how much binary data follows is a usual workaround.

gawk's manpage says that "gawk sets RT to the input text that matched the character or regular expression specified by RS". With this in mind it's not too difficult to read binary data.

#!/usr/bin/gawk -f
#

#
# dd.awk <size> <count> - print <count>*<size> characters from stdin.
#


function readbytes(n,   m, s, line, str) {
	__RS = RS;
#	RS = "\xF1\xF2\x00";
	RS = "\xF1";

	m = n;
	while (m > 0) {
		s = length(__readbuffer);
		if (s == 0) {
			getline __readbuffer;
			if (RT != "")
				__readbuffer = __readbuffer RS;

			if ((s = length(__readbuffer)) == 0)
				break;

			}

		if (s > 0) {
			if (m > s) {
				str = str __readbuffer;
				m = m - s;
				__readbuffer = "";
				}
			else {
				str = str substr(__readbuffer, 1, m);
				__readbuffer = substr(__readbuffer, m+1);
				}
			}
		}

	RS = __RS;
	return (str);
	}

BEGIN {
	size = ARGV[1];
	count = ARGV[2];
	ARGV[1] = ARGV[2] = "";

	bytes = size * count;
	data = readbytes(bytes);

	printf ("%s", data);
	exit (0);
	}

Calling the program with two numeric parameters, size and count, reads size*count bytes from the input (if there are that muc characters in the input) and prints them to stdout. This is (very) basically what dd does.

The interesting thing here is that the function readbytes() receives the number of bytes to read and it returns exactly that much if possible. You should also notice that you cannot mix readbytes() with calls to getline since readbytes() maintains it's own input buffer.

Some tests with the dd.awk script showed that it's execution time depends on the number on lines (or better: records) it finds in it's input, the script is faster if there are fewer lines. For this reason it modifies the RS variable to something I don't expect to appear very often in the input, giving only a few input records. A character sequence would be cleary better for this but coming back to the contents of the RT variable. This is not true, at least for gawk version 3.1.3. Working with a multicharacter RS returns one byte less than available in the input, for version 3.1.4 it's ok.

< dag | at | awk-scripting.de >