
The mod_rewrite module meets all the important requirements for flexibly manipulating Uniform Resource Locators (URLs). The module can be used with the Apache HTTP daemon from version 1.1 and from release number 1.2 it's part of the distribution. mod_rewrite implements a rewriting engine based on regular expressions. Here, six examples demonstrate which differing guises the module can prove to be of use in. The uses described here range from simply renaming a page, which is to be reachable for a while under the old address, to bringing several Web servers together into a cluster.
The basic idea of the additional Apache module is to separate the physical view of the Web server's file system from the logical addressing with URLs, in that it shows consistently and intuitively formulated URL addresses with the right path names. Such rewriting is for one supposed to avoid unwieldy URLs which give back a long path name for the file directory. Secondly, the disadvantages brought about by absolute references can be worked around. For example, in software archives you can have complex references such as <a href="/cgi-bin/targz?/~quux/foo/arc/bar.tar.gz">bar.tar.gz</a> for showing contents. If the position of the script and/or the archive file now changes - due to, for example, the file system being restructured - that requires the hyperlinks to be adapted. For a Web server's local area that's possible, but all the other hyperlinks, on the other hand, (for example, in users' bookmarks or on Web pages at other servers) remain unchanged.
It would therefore be much better if the hyperlink contained just the file to be called up and a syntactical supplement signalled to the Web server that it should supply the contents of exactly this file. In principal, such a supplement can be freely chosen. For example, the RewriteRule in listing 1 replaces a reference to files ending with ".tar.gz" with the execution of a CGI script - provided that a slash, fixed by convention, follows the file name. The above reference can then be given considerably simpler and in an easier to maintain form: <a href="bar.tar.gz/">bar.tar.gz</a>. Web servers previously offered only a rudimentary base for such necessary mapping. By using regular expressions, on account of which the RewriteRule defines the rules by which Apache reformulates the URL, mod_rewrite provides the Web server with considerably more flexibility.
The first directive in listing 1 is absolutely necessary, because it switches on the rewriting engine - without that nothing happens. RewriteBase then sets the current directory's URL. (Both these directives won't be explicitly mentioned in later examples.) Configuring the mod_rewrite Apache module with the help of the named and possible further directives takes place in the Web server's configuration file. (Also see The most important mod_rewrite directives.)
The way the module works is principally very simple. mod_rewrite hooks into Apache during compiling via the API (Application Program Interface). At runtime, the Apache kernel calls up the module for every incoming URL. If necessary, mod_rewrite manipulates this URL and passes it to the Apache kernel for further processing. The results of the manipulation can be different. One possibility is a valid path name, which leads directly to the requested Web page being given out. The alternative can be one of the three events in the kernel:
What the result actually is in rewriting, depends on the rules used. Such a rule always consists of a regular expression, which is used on the URL (comparison), and a results string, which in the case of a successful comparison replaces the current URL (substitution). At the same time the rewriting engine goes through the list of configured rules sequentially and uses each on the current value of the URL, until either a rule breaks off the rewriting prematurely or until the end of the list is reached. The page Rules determine the URL manipulation explains this mechanism with an example.
Along with the server-global context, which refers to all the files administered by the Web master, Apache has a directory context, which establishes a file called .htaccess (see [[#lit 1]]). The configuration directives entered here take effect only on the files in and underneath the directory containing .htacces.
Some of the examples only work in the server-global context, most are formulated for the directory context though. The starting situation is as follows: .htaccess is located on Web server www.quux-corp.com in the physical directory /home/quux/.www/, reachable with URL /~quux/.
A frequent instance of using URL manipulation is surely when files or directories are renamed, though the pages should be reachable for a certain amount of time under the old name. For this a directive such as the following one is used:
RewriteRule ^foo\.html$ bar.html
That causes all the HTTP requests to the URL /~quux/foo.html to be internally converted to /home/quux/.www/bar.html. The user sees the contents of file bar.html, which to the browser is still called /~quux/foo.html.
Alternatively, it's possible to show the contents over the URL rewritten visibly to the user. The page shown remains the same, but the URL shown by the browser is now /~quux/bar.html. Such a HTTP redirect was previously possible with the Apache server's srm.config file, though not for local links or over the comparison with regular expressions.
In the second case the Web server doesn't deliver the content straightaway, but refers the Web browser to the new URL which then requests it immediately. The advantage of this method is that a user who puts a page in his bookmarks already has the new address. To achieve this, you supplement the RewriteRule with the R(edirect) flag at the end. That causes the Web server to deliver the new URL back to the browser.
RewriteRule ^foo\.html$ bar.html [R]
The further application that is often desired is to provide a different page according to the Web browser being used to request a page. The browser makers' "feature of the week" syndrome and the HTML standards committees lagging behind them make such differentating interesting. That way, you can provide pages for certain browser with the latest HTML tags, and at the same time offer them to others conforming to the standard. Previously it was necessary to write a script to do this, the script being called up in place of the default page, but when using the additional module, the RewriteRules in listing 2 cause the Web server to look for the page matching the Web browser.
A combination of RewriteCond and RewriteRule directives can be used to specifically keep Web robots from certain Web pages. In connection with the F(orbidden) flag, the RewriteCond directive can, by blocking out HTTP requests, contribute to keeping down the server strain caused by robots, for example, with dynamically generated pages.
To track down a "poaching" robot which, for example, follows all the links in a large, dynamically generated archive, take a look at the Apache server's log files. If almost all the hyperlinks have been followed in short intervals, that points to such a robot. If the server has been configured to additionally record the HTTP header "user agent", the robot is even easier to recognise (see the documentation about that in the Apache module's mod_log_config). Conventionally, an exclusion takes place using /robots.txt in the Web server's file tree. Only the "poaching" robots don't conform to these conventions.
Listing 3 shows a mod_rewrite method to keep off such a robot, called "NameOfBadRobot". First of all, it's identified by its IP address. The additional asking of its name is supposed to ensure that the server doesn't block out normal users who may be working from the same computer. Additionally, reading the starting page is allowed, so that the robot can read this page and enter it in its index.
There have been so-called proxy caches for a long time for faster access to frequently used pages. They store copies of the pages locally (see [[#lit 2]]).
A dynamic mirror in the Web server's URL name space can be explicitly used as an alternative to a proxy working in the background. A mirror like this is to be recommended with data masses that are frequently requested locally but which change regularly. An example of this is the HOTSHEET homepage (http://www.tstimpreso.com/hotsheet/). In this case it makes sense to provide a dynamic copy on the Intranet which is updated to the status of the original on demand:
RewriteRule ^hotsheet/(.*)$
http://www.tstimpreso.com/\
hotsheet/$1 [P]
In the same way, a dynamic mirror can be set up for any document:
RewriteRule ^usa-news\.html$
http://www.quux-corp.com/\
news/index.html [P]
The RewriteRule's P flag directive is used here, which directs the result URL to the internal Apache proxy module for further processing. The proxy module reads the Web data, writes it in the Apache's permanent cache and delivers it to the Web browser as if it were a local file. If the browser requests the URL again, it gets the Web data from the cache.
Using mod_rewrite also offers its services if a company doesn't just want to work on the official company information on its Web server (www.quux-corp.com), but also information which is actually only available on the Intranet (www-intranet.quux-corp.com), such as employees' homepages. For this it's necessary for all the data not physically available on the Internet Web server to be automatically fetched from the Intranet. Of course, that's only provided that the files being fetched are allowed to be read in such a way. The following convention applies for this in the Intranet: the home directory of employee "quux" is addressable with URL /~quux/ over both the Internet and Intranet. However, in the first instance it's in the directory /home/quux/.www/, whilst the Internet URL references /home/quux/.www/pub/, where the files are, which should be visible outside the firewall. The advantage is obvious: the employee can put both data collections on the Intranet and look after them there, as well as being able to try them out over the Intranet server www-intranet.quux-corp.com under URLs /~quux/ and /~quux/pub/.
The difficulty is that the Intranet files have to be able to be called up safely through the Internet Web server. For this you normally configure a rule on the firewall as shown in listing 4. This can be realised with almost all types of firewall in a similar way. That way you make sure that only the Web server www.quux-corp.com can fetch files from the Intranet via HTTP. A mod_rewrite configuration on www.quux-corp.com then makes sure that this Web server fetches the necessary files on request from the Intranet pub sub-directory (see listing 5).
One of the most interesting application areas of URL manipulation was the reason for developing mod_rewrite. The following example is based an a problem the company sd&m wanted to solve. The starting situation was the fact that the data to be made available via the WWW were situation on various Intranet servers and they didn't want to make the Intranet more complex by using additional NFS mounts. A Web server can only provide the data that it can reach over the file system.
If you don't want a central Web server which needs lots of disk space, then the only option left is to install a further Web server on every machine with user pages. The data can then be reached and the Web strain in the Intranet is distributed over various machines. However, in spite of any possible indices being used, you have lots of separated Web servers and no uniform view of the distributed resources. It would be desirable to bring the individual servers together into a so-called Web cluster and in doing so make a common naming space for URLs, over which you can access the data.
At the same time you make sure that the convention for path names of the home directories is reflected in the URL. In the example here, all the home directories can be reached with /u/user, /g/group and /e/entity. For security reasons the Web files are stored a level below in .www. The root directory of user "quux" is therefore available on the respective server under /u/quux/.www/ and should always be found with URL /u/quux/ in the cluster.
To realise this, first of all the Web servers belonging to the cluster in the domain name service (DNS) are allocated aliases swwN.sdm.de, where "N" serves as consecutive numbering. When any server is accessed under http://swwN.sdm.de/u/quux then a redirection to the physical server should take place. If user "quux" has his home directory on sww2.sdm.de, then server sww1.sdm.de has to be known too, so that it can deliver the new URL http://sww2.sdm.de/u/quux/ back to the browser by an HTTP redirect. The server sww2.sdm.de on the other had, has to recognise that quux is reachable locally and deliver the data directly.
For speed reasons the servers draw their knowledge from a previously created table, which is distributed to the Web servers. Jobs on the individual servers transmit the information about the locally available home directories via HTTP to a dedicated server in the Web cluster every hour. In return they get the other servers' information from this dedicated server for updating their own table. (The official mod_rewrite distribution contains the Perl program necessary for this.) These tables have the following form:
foos ww3.sdm.de bar sww1.sdm.de quux sww2.sdm.de baz sww4.sdm.de : :
The RewriteMap mechanism is used for linking in the server-wide Apache configuration on the individual Web servers. It assigns every table with a unique name by which it can be queried in the RewriteRule directive's substitution strings:
${Name_of_the_table:Request_key|Default_value}Considering that in case of a self-reference mod_rewrite attempts to substitute an URL like http://host:port/path to /path, the solution for the Web server swwN.sdm.de is almost obvious.
Listing 6 determines the physical Web server from the associated table using the name of the user, the group of the entity for every URL underneath /u/, /g/ and /e/. Additionally, the prefix "http://physical_host" supplements the current URL. If "physical_host" is an alias for the local server, mod_rewrite removes the prefix again immediately. Otherwise, this fully qualified URL leads to an HTTP redirect, which the browser refers to the new Web server "physical_host".
The last application example shows a way out of all the problem situations that aren't yet supported by mod_rewrite. An external program takes over the URL manipulation here, in that it functions as a sort of dynamic rewriting map. To prevent the Apache user slipping into a continuous cycle, the declaration of such programs can only take place in the server configuration and not in the users' .htaccess files.
Suppose that user "quux" needs very complex URL manipulation for his home directory /~quux/. The Web master can set up a script especially for this user, as shown in listing 7, (providing he has previously checked the program).
These directives take the effect that the next time Apache is started, map.quux.pl is called up and runs in parallel. Apache sends all the URLs underneath /~quux/ to this program using the standard input and takes over the result delivered back from there over the standard output as a new URL. Any program that can be executed under Unix can be in the place of the Perl script. Listing 8 shows a simple example for map.quux.pl. In connection with the configuration above, when using Netscape Navigator, this takes effect of rewriting the URL /~quux/foo/index.html into /~quux/bar/index.html.
In spite of all the advantages that mod_rewrite offers, it shouldn't go unmentioned that flexible URL manipulation has a price, because the numerous comparisons with expressions are CPU intensive. However, practise has shown that up to 50 global RewriteRules don't have a visible influence on the Web server's performance on an average SPARCstation 10/61.
You should also not underestimate that the wrong usage of rules can easily lead to obscure result URLs. For looking into the cause of such problems, every URL change can examined in mod_rewrite's logfile. There is also a collection of example solutions at http://www.engelschall.com/sw/mod_rewrite/solutions/.
Ralf S. Engelschall
is a student of Computer Science at the Technical University of Munich and has been responsible for Unix systems and Internet services at sd&m GmbH and Co KG for four years. Email rse@engelschall.com
Christian Reiber
is an IT graduate employed as a system engineer at Zeppelin Baumaschinen GmbH. Email chrei@en.muc.de
Literature
[1] Henning Behne; Web Indian; Configuring the Apache HTTP daemon, iX 6/96, pp122
[2] Rainer Klute; Zwischenstation; Mit dem Proxy-Server Zeit und Geld sparen; iX 2/95, pp154
| iX-TRACT |
|
Dieser Text ist der Zeitschriften-Ausgabe 12/1996 von iX entnommen.
Parallelprogrammierung - die Kunst der Multi-Core-Nutzung
Agile ALM - agile Praktiken im Application Lifecycle Management
Webentwicklung - Applikationen für mobile Clients