.\" Man Page for BFILTER .\" groff -man -Tascii bfilter.8 .TH BFILTER 8 "August 2005" .SH NAME bfilter \- An ad-filtering web proxy using heuristic ad-detection algorithms .SH SYNOPSIS .B bfilter [-c DIRECTORY] [-r DIRECTORY] [-u USER] [-g GROUP] [-n] [-h] [-v] .SH "DESCRIPTION" .PP .B bfilter is a web proxy that uses effective heuristic ad-detection algorithms to remove banner adverts, popups and webbugs from web pages. The traditional blocklist based approach is also implemented, but it is mostly used for dealing with false positives. Unlike other tools that require constant updates of their blocklists, bfilter manages to remove over 90% of adverts even with an empty blocklist! .P All processing is done on the fly, it doesn't load the whole page or image before processing. It uses heuristic and regex-based approaches to detect adverts and webbugs. It also uses a Javascript engine to combat Javascript generated adverts and popups. .P The web proxy supports the following features; .PP .B o HTTP/0.9 - HTTP/1.1 support .br .B o Persistent connections (HTTP/1.1 only) .br .B o Pipelining (HTTP/1.1 only) .br .B o HTTP compression .br .B o Forwarding to another proxy .P However, it does .B not support CONNECT requests typically used for HTTPS. .SH OPTIONS .TP .B -c, --confdir DIRECTORY Set custom config directory .TP .B -r, --chroot DIRECTORY Set chroot directory .TP .B -u, --user USER Set unprivileged user .TP .B -g, --group GROUP Set unprivileged group .TP .B -n, --nodaemon Disable background daemon mode .TP .B -h, --help Show help .TP .B -v, --version Print version .SH RESOURCES .HP .B /etc/bfilter/config .br .I listen_address = host:port .br The address to bind the proxy to. If unspecified, bind to all interfaces. .br .I client_compression = yes | no .br If set to yes, all the textual data with "Content-Type: text/*" will be compressed before sending it to the client. This option can be useful if you are on a slow connection and you set up bfilter somewhere on a fast connection. In other cases, setting this option to yes will just introduce additional latency to the loading process. .br .I ad_border = rrggbb | none .br The default behavior is to draw borders around removed adverts. You may want to change the border color or turn the borders off. .br .I no_flash = yes | no .br This option is for people who don't want to install a Flash plugin and don't want to be constantly prompted to do so. Setting it to yes will cause all Flash objects to be replaced with transparent GIF's. (You can't use rules to achieve the same effect because a Flash advert is normally replaced with a blank Flash object that loads the original into itself when you click on it.) .br .I use_proxy = yes | no .br .I proxy_host = host .br .I proxy_port = port .br When use_proxy is set to yes, you may specify a proxy for bfilter to forward requests onto. .br .I no_proxy_for = host, host, host .br When use_proxy is set to yes, you may specify some hosts to be contacted directly. The separator may be either a comma or a semicolon. If a host starts or ends with a dot it is assumed that any prefix or suffix can be appended to it, so for example "no_proxy_for = .mydomain.com, 192.168."). Note however that .mydomain.com won't cover mydomain.com itself but only its subdomains. (When matching no_proxy_for hosts, no DNS queries are being made. That means 127.0.0.1 won't act as localhost or the other way around.) .HP .B /etc/bfilter/rules .br .I filter=0|1 .br Enable filtering. .br 0: Serve the page as is .br 1: (Default) Check for ads and apply the appropriate transformations .br .I ad=0|1|2 .br Advert detection options. .br 0: (Default) Standard procedure for is_ad decision .br 1: Force negative is_ad decision .br 2: Force positive is_ad decision .br .I scripts=0|1|2|3|4|5|6|7 .br Javascript filtering options. The default value of 3 is effective against js-generated ads, but breaks some sites which are too much dependent on Javascript. Fortunately, the built-in Javascript engine mostly solves this problem. .br 0: Leave as is .br 1: Remove 3rd party scripts except in header .br 2: Remove 3rd party scripts from everywhere .br 3: (Default) Only allow scripts in header and those 1st party scripts that don't contain ".write" .br 4: Only allow scripts in header and those 1st party scripts that contain "function " .br 5: Only allow scripts in header .br 6: Only allow 1st party scripts and only in header .br 7: Remove all scripts .br .br .I jsengine=0|1 .br Enable Javascript engine. When the Javascript engine is used, the scripts parameter is ignored. The output of a script (generated by document.write or writeln) is directed to the standard advert detector. If it detects an advert, the script gets removed. .br 0: Don't use .br 1: (Default) Use if possible .br .I target_blank=0|1 .br New window attribue for link option. A link may be marked to be opened in a new window if target="_blank" is specified as attribute of an tag. .br 0: (Default) Leave as is .br 1: Remove attribute .br .I [regex] .br For applying specific options to specific sites. Used after defaults have been setup. See .B RULES section for further information. .br .HP .B /etc/bfilter/rules.local .br For local rules and redefining the global parameters. Uses the same syntax as for the global rules file. .SH RULES Rules are used for blocking ads which aren't automatically detected and/or for dealing with false positives. The rule format is: .P [regex] .br param1=val1 .br param2=val2 .P The regex gets converted to "^http://"+regex+"$" and uses the POSIX extended syntax. For those unexperienced with regular expressions, a few explanations: .B . means any character .br .B \e. means the "." character .br .B \e? means the "?" character .br .B .* means any number of any characters including none .br .B (this|that) means "this" or "that" .br .B (something)? means "something" or nothing .P You may use any of the global parameters such as filter, ad, scripts or jsengine in rules. The parameters you don't specify are implicitly set to the corresponding default value. .P It is possible to have several rules match a single url. In this case the lowest values for each parameter are used. That is, the values for different parameters may be taken from different rules. .SH RULES RELATIONSHIP .B Question: What is the relationship between rules and rules.local files? Do records in rules.local override the ones in rules or supplement them? .br .B Answer: It's a rather complex relationship which will be shown in the following example. .HP Suppose the rules file looks like this: .br filter=1 .br jsengine=1 .br # Other parameters are omited .br [regex1] .br filter=0 .HP And the rules.local file looks like this: .br jsengine=0 .br [regex2] .br filter=0 .P First of all, the default .I filter=1 parameter from rules is also implicitly present in rules.local as it's not overriden there. Then, although only one parameter is associated with each regex in this example, all of the other parameters are also implicitly associated with them and their values are taken from defaults of the corresponding file. So in reality the [regex1] record also contains .I jsengine=1 and the [regex2] record also contains .I jsengine=0. .P Now suppose we want to get the jsengine parameter for an URL that matches regex1. First we look for a matching regex in rules.local. Having found none we continue to look in rules where we find the [regex1] record that matches the given URL. This record has an implicit .I jsengine=1 parameter which we were looking for. If our URL doesn't match any of the regexes, we take the default parameter from rules.local which is .I jsengine=0 \/. .SH EXAMPLES .B 1) All images from hosts or paths with standard advert hostnames or paths are classified as adverts and filtered. .P [(.*/)?banners?(/|\\.).*] .br ad=2 .br [(.*/)?ad[sv]?(/|\\.).*] .br ad=2 .br [(.*\\.)?ad[0-9]*\\..*] .br ad=2 .P .B 2) Allow images from the distributed content provider Akamai. .P [.*\\.akamai.net/.*] .br ad=1 .P .B 3) Disable Javascript engine for the Hitweb tracker and uses scripts rules setting instead for filtering. .P [(www\\.)?hitweb\\.info/Download\\.asp\\?\/.*] .br jsengine=0 .P .B 4) Allow images used to count page views for projects hosted on SourceForge. .P [(www\\.)?sourceforge.net/sflogo.php\\?.*] .br ad=1 .SH CONTROLLING Restart bfilter to reload configuration files. .P Sending a .B SIGUSR1 to all bfilter processes will cause the child processes only to exit after handling their last request. .SH NOTES If the HTML processor is in doubt about an image or a Flash file, it defers the decision until the browser has requested that file. The response is then analyzed (redirects, cookies) as well as the file itself. For an image, the analyzer checks its dimensions and whether it's animated or not. For Flash files, the analyzer is trying to find a button that covers most of the object's area and has a getURL action associated with it. Depending on the results, the object is either forwarded to the client, or substituted with a generated replacement. (Unfortunately, analyzing objects that are placed with Javascript doesn't work, as their URLs in javascript source cannot be altered.) .SH BUGS Please report any bugs you may find to: .P .B http://sourceforge.net/projects/bfilter .SH AUTHOR Joseph Artsimovich .br http://bfilter.sourceforge.net .SH SEE ALSO regex(7) .I http://mozilla.org/js/spidermonkey/ .I http://www.iki.fi/vl/tre/