Attack of the killer Squid
Squid is the deadliest Web cache in the entire world. Mike MacCana pokes a stick in its eye.
I've recently spent a couple of weeks on site at a client running a Linux training course and dabbling with the popular and powerful Web-caching package, Squid. During the training course it's often useful to provide students with Web access to check their email, new sites, etc. during breaks, and to look up some of the online resources that may have been mentioned in the notes.
Much in the same way as most corporate environments, the training network uses a Web proxy server to provide desktop machines with Web access. A Web proxy cache is simply a server that acts on behalf of Web browsers, fetching information from Websites (what's technically called origin servers in caching parlance) and returning it to them.
More to the point, it also caches that information so when another user comes along to request the same object, the server can simply return the saved version from the cache rather than pulling down that file from the Internet again. In this way, the cache makes for seemingly faster Web browsing and saves on bandwidth costs.
There are a few other reasons why Web caches, and Squid, are popular:
• Setting up a Web cache (and not fully sharing out an Internet connection using NAT) can help curb the use of unauthorized network apps (such as peer-to-peer file trading apps) by limiting traffic to Websites. However it's often possible to use SSL to tunnel out other protocols through a Web proxy using tools such as Transconnect.
• Monitoring staff for violations of acceptable use policies. As all Web access occurs through Squid, it provides a convenient choke point for your logging.
A wide variety of analysis tools work with Squid's log file format to produce exact details of who's browsing what. In our training room, I've never had any issues, but another instructor has found a student 'making the most of' the company's broadband connection during a course. In case that happens again, we log exactly who's looking at (or doing) what in the training environment.
• It's easy to filter content that goes against company HR/security policy using the various Open Source and proprietary Web-filtering Squid plugins. One of the most popular of these is Dan's Guardian. Unlike most other filters, rather than just using a simple blacklist of banned sites (that's never quite as updated as one would like) or looking for objectionable URLs (which tends to miss quite a lot of the content you'd like to filter), Dan's Guardian also filters based on phrases found within the actual page content.
• Some ISPs (eg. Telstra) also allow peering of their own parent cache. In this situation, if a Web browser requests a piece of information not in the company cache, that cache will attempt to fetch that object from the ISPs cache before fetching it from the origin server (the real Website). As this saves the ISP bandwidth, many ISPs (such as Telstra) charge a discount rate for downloading information that has been cached.
• Using a Web cache eases the load on the origin servers, making your organisation a good Net citizen.
Squid is the most popular Web cache in the world, and has around 70-85% market share depending on who you talk to. Squid's an open source app, packages come with most Linux distributions and setting it up is pretty simple. I used Red Hat 8 for the training lab, but the techniques used below could just as easily be used for the majority of other Linux distributions available.
First, open the terminal, switch to root and install the Squid package:
su -
root password
rpm -Uvh <squid package.rpm>
Squid, like all good server apps, is shipped in a locked down configuration by default – allowing cache access only from localhost (ie. the machine running the Squid server). So before starting the service, we need to create rules in Squid's access control allowing the rest of the network to access the cache. I could have used a GUI admin tool like Webmin (www.webmin.com) to configure the app, but I modified the configuration file directly because I'm an old school Unix man, despite not having a really big beard. I opened /etc/squid/squid.conf and scrolled down to the words 'ACCESS CONTROL'. I then added the following two lines:
acl desktops src 192.168.5.0/255.255.255.0
http_access allow desktops
The first line defines a group called 'desktops' which is made up of users requesting to use the cache from the network '192.168.5.0/24'. The name 'desktops' is arbitrary, and up to the administrator, but should describe the name of the group. Modify the network address definition as appropriate. The next line grants this group access to the proxy. As the rules are applied in order, this line must be added before the existing 'http_access deny all' line in the config file. I saved the file and activated the squid service with:
service squid start
'service', by the way, is a Red Hat shortcut that does the same thing as typing /etc/init.d/squid start on any other distribution of Linux.
I configured the training desktops to use the proxy (ie. the hostname or IP of the server machine, port 3128) and tested access.
Now, install the 'webalizer' package (it also comes with Red Hat) and run webalizer -F squid /var/log/squid/access.log
This will produce a neatly formatted HTML report in /var/www/usage/index.html, detailing the most visited sites, peak times of the day, comparative bandwidth used for previous months, and a bunch of other useful statistics so you can see exactly what's going in your internal network.
If the Hat Fits...
Alongside, 'Why Linux?' (which I'll tackle another time) the most common question from both clients and readers would be what distribution I prefer. At this point, most Linux columnists would probably provide some kind of detailed discussion on the relative merits of various Linux flavors, but I'll cut to the chase and simply say: the latest Red Hat (8.0 at the time of writing, though 8.1 may be out by the time you read this).
I've used, and continue to use, most Linux distributions on a regular basis, and I've come to the conclusion that Red Hat simply produces a better quality distribution for most purposes, whether businesses, enthusiasts, desktops or servers. It's accessible, with a reasonably friendly quality installer and finally, in 8.0, good quality graphical admin tools for most tasks. It encourages good security practices with a minimal set of network services running by default, and a nifty little tool that prompts users to install updates when they become available. All packages are cryptographically signed by the company allowing users to easily identify potentially tampered software and files on disk can be easily compared to their packaged installs to find out exactly what's changed.
The release cycle is frequent without being overwhelming for those that support it: each major release (6.x, 7.x, 8.x) maintains application compatibility across itself, with the point releases (eg, 7.3, 8.1) providing new features and updates of the included software.
The startup sequence is easy to understand; it's easy to determine whether a given service is running. The distro includes a fairly wide range of software, including multiple implementations of FTP and SMTP servers, allowing experienced users to choose between older, more popular apps like Sendmail and WU FTP and newer, better designed equivalents like Postfix and Very Secure FTP.
Hardware and software support is