Regular Expressions
yawk's search function looks for regular expressions. If you don't have your regexp documentation around here's an excerpt from gawk's manpage. Formatting mistakes are mine.
-
Regular Expressions
Regular expressions are the extended kind found in egrep.
They are composed of characters as follows:
c matches the non-metacharacter c.
\c matches the literal character c.
. matches any character including newline.
^ matches the beginning of a string.
$ matches the end of a string.
[abc...] character list, matches any of the characters
abc....
[^abc...] negated character list, matches any character
except abc....
r1|r2 alternation: matches either r1 or r2.
r1r2 concatenation: matches r1, and then r2.
r+ matches one or more r's.
r* matches zero or more r's.
r? matches zero or one r's.
(r) grouping: matches r.
\y matches the empty string at either the begin-
ning or the end of a word.
\B matches the empty string within a word.
\< matches the empty string at the beginning of a
word.
\&> matches the empty string at the end of a word.
\w matches any word-constituent character (letter,
digit, or underscore).
\W matches any character that is not word-con-
stituent.
\` matches the empty string at the beginning of a
buffer (string).
\' matches the empty string at the end of a
buffer.
The escape sequences that are valid in string constants
(see below) are also valid in regular expressions.
Character classes are a new feature introduced in the
POSIX standard. A character class is a special notation
for describing lists of characters that have a specific
attribute, but where the actual characters themselves can
vary from country to country and/or from character set to
character set. For example, the notion of what is an
alphabetic character differs in the USA and in France.
A character class is only valid in a regular expression
inside the brackets of a character list. Character
classes consist of [:, a keyword denoting the class, and
:]. The character classes defined by the POSIX standard
are:
[:alnum:] Alphanumeric characters.
[:alpha:] Alphabetic characters.
[:blank:] Space or tab characters.
[:cntrl:] Control characters.
[:digit:] Numeric characters.
[:graph:] Characters that are both printable and visible.
(A space is printable, but not visible, while
an a is both.)
[:upper:] Upper-case alphabetic characters.
[:xdigit:] Characters that are hexadecimal digits.
For example, before the POSIX standard, to match alphanu-
meric characters, you would have had to write
/[A-Za-z0-9]/. If your character set had other alphabetic
characters in it, this would not match them, and if your
character set collated differently from ASCII, this might
not even match the ASCII alphanumeric characters. With
the POSIX character classes, you can write /[[:alnum:]]/,
and this matches the alphabetic and numeric characters in
your character set.
Two additional special sequences can appear in character
lists. These apply to non-ASCII character sets, which can
have single symbols (called collating elements) that are
represented with more than one character, as well as sev-
eral characters that are equivalent for collating, or
sorting, purposes. (E.g., in French, a plain "e" and a
grave-accented e` are equivalent.)
Collating Symbols
A collating symbol is a multi-character collating
element enclosed in [. and .]. For example, if ch
is a collating element, then [[.ch.]] is a regular
expression that matches this collating element,
while [ch] is a regular expression that matches
either c or h.
Equivalence Classes
An equivalence class is a locale-specific name for
a list of characters that are equivalent. The name
is enclosed in [= and =]. For example, the name e
might be used to represent all of "e," "`," and
"`." In this case, [[=e=]] is a regular expression
that matches any of e, e</B'</B', or e</B`</B`.
These features are very valuable in non-English speaking
locales. The library functions that gawk uses for regular
expression matching currently only recognize POSIX charac-
ter classes; they do not recognize collating symbols or
equivalence classes.
The \y, \B, \&<, \&>, \w, \W, \`, and \' operators are
specific to gawk; they are extensions based on facilities
in the GNU regular expression libraries.
-