CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+
All text substitutions that can be invoked through the filter action must first be defined in the filter file, which is typically called default.filter and which can be selected through the filterfile config option.
Just like the actions files, the filter file is organized in sections, which are called filters here. Each filter consists of a heading line, that starts with the keyword FILTER:, followed by the filter's name, and a short (one line) description of what it does. Below that line come the jobs, i.e. lines that define the actual text substitutions. By convention, the name of a filter should describe what the filter eliminates. The comment is used in the web-based user interface.
A filter header line for a filter called "foo" could look like this:
Below that line, and up to the next header line, come the jobs that define what text replacements the filter executes. They are specified in a syntax that imitates Perl's s/// operator. If you are familiar with Perl, you will find this to be quite intuitive, and may want to look at the PCRS man page for the subtle differences to Perl behaviour. Most notably, the non-standard option letter U is supported, which turns the default to ungreedy matching.
If you are new to regular expressions, you might want to take a look at the Appendix on regular expressions, and see the Perl manual for the s/// operator's syntax and Perl-style regular expressions in general. The below examples might also help to get you started.
Now, let's complete our "foo" filter. We have already defined the heading, but the jobs are still missing. Since all it does is to replace "foo" with "bar", there is only one (trivial) job needed:
But wait! Didn't the comment say that all occurrences of "foo" should be replaced? Our current job will only take care of the first "foo" on each page. For global substitution, we'll need to add the g option:
Our complete filter now looks like this:
Following the header line and a comment, you see the job. Note that it uses | as the delimiter instead of /, because the pattern contains a forward slash, which would otherwise have to be escaped by a backslash (\).
Now, let's examine the pattern: it starts with the text <script.* enclosed in parentheses. Since the dot matches any character, and * means: "Match an arbitrary number of the element left of myself", this matches "<script", followed by any text, i.e. it matches the whole page, from the start of the first <script> tag.
That's more than we want, but the pattern continues: document\.referrer matches only the exact string "document.referrer". The dot needed to be escaped, i.e. preceded by a backslash, to take away its special meaning as a joker, and make it just a regular dot. So far, the meaning is: Match from the start of the first <script> tag in a the page, up to, and including, the text "document.referrer", if both are present in the page (and appear in that order).
But there's still more pattern to go. The next element, again enclosed in parentheses, is .*</script>. You already know what .* means, so the whole pattern translates to: Match from the start of the first <script> tag in a page to the end of the last <script> tag, provided that the text "document.referrer" appears somewhere in between.
This is still not the whole story, since we have ignored the options and the parentheses: The portions of the page matched by sub-patterns that are enclosed in parentheses, will be remembered and be available through the variables $1, $2, ... in the substitute. The U option switches to ungreedy matching, which means that the first .* in the pattern will only "eat up" all text in between "<script" and the first occurrence of "document.referrer", and that the second .* will only span the text up to the first "</script>" tag. Furthermore, the s option says that the match may span multiple lines in the page, and the g option again means that the substitution is global.
So, to summarize, the pattern means: Match all scripts that contain the text "document.referrer". Remember the parts of the script from (and including) the start tag up to (and excluding) the string "document.referrer" as $1, and the part following that string, up to and including the closing tag, as $2.
Now the pattern is deciphered, but wasn't this about substituting things? So lets look at the substitute: $1"Not Your Business!"$2 is easy to read: The text remembered as $1, followed by "Not Your Business!" (including the quotation marks!), followed by the text remembered as $2. This produces an exact copy of the original string, with the middle part (the "document.referrer") replaced by "Not Your Business!".
\s stands for whitespace characters (space, tab, newline, carriage return, form feed), so that \s* means: "zero or more whitespace". The ? in .*? makes this matching of arbitrary text ungreedy. (Note that the U option is not set). The ['"] construct means: "a single or a double quote". Finally, \1 is a backreference to the first parenthesis just like $1 above, with the difference that in the pattern, a backslash indicates a backreference, whereas in the substitute, it's the dollar.
So what does this job do? It replaces assignments of single- or double-quoted strings to the "window.status" object with a dummy assignment (using a variable name that is hopefully odd enough not to conflict with real variables in scripts). Thus, it catches many cases where e.g. pointless descriptions are displayed in the status bar instead of the link target when you move your mouse over links.
Including the OnUnload event binding in the HTML DOM was a CRIME. When I close a browser window, I want it to close and die. Basta. This job replaces the "onunload" attribute in "<body>" tags with the dummy word never. Note that the i option makes the pattern matching case-insensitive. Also note that ungreedy matching alone doesn't always guarantee a minimal match: In the first parenthesis, we had to use [^>]* instead of .* to prevent the match from exceeding the <body> tag if it doesn't contain "OnUnload", but the page's content does.
The last example is from the fun department:
Note the (?!\.com) part (a so-called negative lookahead) in the job's pattern, which means: Don't match, if the string ".com" appears directly following "microsoft" in the page. This prevents links to microsoft.com from being trashed, while still replacing the word everywhere else.
The x option in this job turns on extended syntax, and allows for e.g. the liberal use of (non-interpreted!) whitespace for nicer formatting.
You get the idea?