Relevance Search
yawk's uses a relevance search for document search. Usually you don't have to know about search expressions, just enter the words (or word fragments) you want to search and let yawk search. That's all.
yawk's relevance search is basically a non-indexed full text search on all wiki files. Assuming the seach expression "relevance + search" searching works as follows:
- Read the (unformatted) wikifile.
- Count the occurences for each of the search words "relevance" and "search",
- apply the arithmetic expression to them (in this example: add the numbers),
- divide the final result by the file's size, and
- store that value as file relevance score.
- Repeeat the above for each file.
- Sort all documents with a non-zero relevance and compute each file's relevance as percentage. The file with the maximum relevance defines 100%.
- Output the resulting list.
Search expressions
Search expressions are made of operands (search words) and operators. The following operators (shown by increasing precedence) are defined:
| || | logical or |
| && | logical and |
!= > < >= <= >> << >>= <<= |
comparision operators |
| + - | plus, minus |
| * / , ,, | multiplication, division and "product+sum" |
| + - ! | unary plus, minus and not |
| () numbers words strings | expressions can be grouped with parenthesis, numbers, words and strings are literals. |
Due to the nature of relevance search not all operators behave as usual.
- The logical operators (or, and) and the standard comparators return 0 if the expression evaluates to false and to 1 otherwise. In contrast to this the operators >>, <<, >>= and <<= return the operator's right hand value if the expression is true.
- If the result of a subtraction would return a negative value the result is set to 0.
- If the left hand operand in a division is 0 or negative the operands result is set to zero.
- The comma operator has the following definition: "
x , y := (x * y) + (x + y)". This is useful if you want to search for documents that contain both of two search words but you want to see also those documents that contain at least one of the words. - The double-comma operator returns the same as the normal comma operator but only if both operands are greater than zero.
Literals
The following literals are recognized:
- Words are sequences of letters followed by more letters and/or digits are replaced by their match count for each document.
- Numbers are taken "as is".
- Quoted strings (can single or double quoted) are handled as word literals but may contains non-word characters like operators.
Notice also that
- The relevance search doen't really search for words, but for words fragments, e.g. the search literal awk is found in words like awk, gawk and yawk.
- String literals are regular expressions (word literals too but these can not contain special characters by definition). The search term '[^gy]awk' (notice that the single quotes belong to the search term) matches awk but not gawk or yawk.
- Relevance search is case insensitive and searches the unformatted wiki files.
File related literals
The only file related literal is %size which can be used as any other literal. %size returns the file's size in bytes.
Notice that when a file literal appears as first word in the search expression the expression must have at least on blank to prevent the expression from being interpreted as yawk's file search.
Default operator
Whenever two consequetive literals appear in the search expression a double-comma operator is inserted between them.
Sample expressions
The following table gives some sample expressions:
| Search term | Description |
| relevance * search | lists all documents that contain the words "relevance" and "search". |
| relevance + search | lists the documents that contain either or, or both words. |
| relevance , search | same as above but documents containing both words get usually a higher ranking. |
| relevance search | exactly as above. |
| wiki >= 5 | lists all documents that contain "wiki" at least five times. |
| wiki >>= 5 | same as above but with document ranking. |
| %size >> 1000 | list all files with more than 1000 characters. |
| %size > 0 | list all files. |
Since relevance search uses the comma operator as default you can usually simply enter the words you're looking for. The comma operator tries to resemble the function of common search engines: list all documents containing at least one of the search words but give a higher ranking to those having all words.