Hits of the Day
"Our server gets a hundred thousand accesses a day!" To be able to quantify such claims, you have to be able to show how you got this figure. At the moment, most of the tools available for doing that seem to be either unreliable or devious.<br />
Judging a server's popularity and acceptance is interesting in many ways: a powerful analysis can help the operator optimise the server's structure, advertising customers demand cost/usage estimates and a system administrator gets an argument for getting additional hardware or a better Internet connection. But even though the reason for wanting and analysis is understandable, the technical realisation is difficult, because there aren't generally known yardsticks and measuring methods.
There are lots of programs for evaluating WWW server log files. Excellent examples are Analog and 3Dstats. Analog - a freeware program for Unix, VMS, MacOS and DOS - is very fast, extremely flexible and easy to configure. Its strengths are the various reports, each of which is configurable, and an integrated cache for data already ascertained from older log files, which allows analysis - the calculation time is usually under five minutes - over various months. This tool can also generate a summary, various monthly, weekly, daily and hourly reports and domain, host and directory reports. When using the NCSA/Apache daemons, there are also overviews for browsers, errors and links. The presentation of the information can be in either English (with British or American expressions), German or French. Analog has over 180 options and that's surely enough for most user's requirements. Stephen Turner, the author, also likes to get comments about it.
3Dstats is graphically more complex. It presents its overviews as VRML worlds. The user can zoom in and out of detailed overviews for the various months and weeks. Although it's very visually appealing, because these overviews are presented as a pure graphic, and are therefore less accurate, they are only suitable for analysing trends. In addition, you need a VRML viewer to look at them.
Meaningfulness of the data is questionable
Soon after the initial joy of the nice-looking graphics the question arises of how meaningful they are. Can the result be drawn on to help make a sensible judgement of the users' acceptance? That can't be answered without a more precise definition of the measurement sizes and knowledge of the method of analysis.
The first and biggest factor of uncertainty with log file analysis is the existence of the various WWW caches (Harvest Squid, Apache, CERN and Netscape). These proxy servers act as a time-saving, temporary storage of pages that have been called up once (see [[#literature 1]], [['literature 4]] and [[#literature 5]]), but which then don't appear in the server's log files. Access to the respective cache at the institution can be mandatory for different user circles, be it providers' customers, research network members or users with a direct Internet connection. That means that a locally installed firewall only allows such Web browsers access to the outside world if they have a proxy cache entry configured. According to the size of the individual caches and their special features - for example, integration in a joint cache system - each cache has a different rate of effectiveness.
Proxy caches not considered
The number of caches taking part in the German integrated cache system (http://www.informatik.uni-bonn.de/de-cache/) is currently 48 (23 September 1996) with a total hard disk capacity of 188 Gbytes and the number of Harvest Squid caches in Germany will be over 130. DFN [German Research Network] is planning to install a further ten cache servers at the B-WiN central nodes and they will also be integrated into the cache system, where they will afford the highest status in the hierarchy. In addition to that, there is an unknown number of Apache, CERN and Netscape caches being used at a number of providers (like, for example, the T-Online cache with 1.2 million users). In view of all the cache servers, you can only draw on the analysis of WWW server log files to a limited extent for assessing the quality of your own Web offering.
|In a hierarchical cache system, as is realised with the DE-Cache, objects are exchanged without them being logged by the originating WWW daemon. (diagram 2)|
A counter argument for the statistics advocates is that not all Web users make use of a proxy cache. That's true, but the number of cache users is so high that they influence the measured figure too much for this to serve as a basis for an analysis. Additionally, in a cache union the users exchange the cached objects between themselves, so that in this case there are even less direct accesses to the WWW server.
The only exact information that you can get from a log file and use without limitations is the information about the amount of data transmitted. These figures give information about how much your Web offering has used the WWW server's Internet connection. However, system administration tools have more to offer here than statistics programs. Standard tools such as net-acct breakdown the amount of data transferred and differentiate between sent and received data.
Who sent the request is unclear
All the remaining log file entries can only be interpreted as the minimum of the WWW objects actually transmitted, on account of all the caches which are being used. Whether a request to a server comes from an actual user, a robot or a meta search engine isn't taken into consideration either. Because these services want to stay up-to-date, they have to constantly call up and index the whole Web. With all the search engines this represents a not in significant number of hits.
There are further methods for finding out the actual user statistics. Some Webmasters put a counter (see [[#literature 2]]) on every page as the first and obvious solution. Either a new gif image file is created for every access or there is an automatically generated text which the WWW deamon integrates into the page by server side includes. However, this method doesn't include every single access anymore, because some of the newer caches, such as Harvest Squid, mercilessly offer the pages with the server side includes themselves - regardless of the fact that as a result a the counter status is wrong (there's an example on the Counter Page from Webtools.org). A further inaccuracy is caused by users who work with text-based browsers such as Lynx or who have image loading turned off when they access pages to be counted.
Cookies aren't an alternative to counters
As an alternative to counters, you can use a cookie on every page [[#literature 3]]. In this case, the server passes cookie information to the client every time a URL is called up and the client saves the cookie information in the associated file, which the browser puts in the appropriate sub-directory. But because not all browsers support cookies, users of browsers that do would be overvalued, because an analysis would only take place on account of their cookie information. All the other users wouldn't be included.
Assuming that most caches are configured so that they don't save files whose URL contains a cgi-bin directory, Web documents to be measured could be created by a script which moves an internal counter on one each time the page is called up. Though a configuration like that is common practice, it's not absolutely necessary and also not sensible for some user groups. In addition, this method goes totally against the idea of caches and can lead to annoying delays.
A principal problem arises due to a newly developed feature in Linux: IP masquerading hides almost any number of computers behind just one IP address. Assigning the reported IP addresses to the actual ones is impossible.
A further problem is caused by Internet cafés and areas with computers available to a wide range of people. In a working environment like that you can't tell whether user A has taken a break and then carried on using the computer, or if person B is now on the machine. However, this can be smoothed out with suitable statistical methods.
Standardising measurement criteria
In order to at least be able to make an attempt at making comprehensible claims about the effectiveness of online advertising there are developments and finished programs from commercial companies. The following will go into some detail about the two most interesting - the Rawena method versions 3.0 and 3.1 from Ecce Terram, and AC Nielsen's Webtracking method. These two ways of measuring originate from the terms such as hit, visit or page view, which have been used for a long time. The following definitions of those are from Rawena 3.0:
- Visits add the number of Internet hosts which have held at least one WWW page from the server, information taken from the server log files. Further transmissions to this address are ignored if they take place within a predefined period of time. Ecce Terram has fixed how long this time is (in the new version this size is no longer required).
- Page views give information about how often a user has received a page, containing online advertising. It's insignificant whether that's the same or different users.
- Hits count every access to a Web object, taken from the server log file. Because this includes displaying graphics, photos, logos or in the case of frames several sides, a single page is sometimes counted several times, even if it's only called up once.
The Rawena 3.0 method was used until 1 September 1996 and then replaced by version 3.1. The first version should work with log files from the WWW daemon and several caches. Ecce Terram published extensive documentation about the terms used and ideas, though not the exact method - for legal reasons the company says. That's why it was only possible for paying license holders to verify the method used for errors and loopholes. In this version - in the author's opinion - there are also problems with cascading caches.
Method of measuring increases strain on server
Version 3.1 is used now. It works with a so-called Z Box. This does introduce a small problem though, that access to a
type Web document which is to be measured, counts the accesses. Of course that only works supposing that - as already mentioned earlier - /cgi-bin/ programs aren't saved by a cache. Z Box is installed on port 80 - the standard port for a WWW server - and communicates with the actual Web server over a second port. This possibly requires changes to CGI programs and because all the data has to be exchanged over the ports, this method increases the strain on the server. For heavily used servers this solution isn't exactly optimal.
Even the assumption that the cache servers don't store URLs directed over the Z Box requires verification on account of new caches (see above). There aren't problems with cascaded caches with this method because the all the header lines of the request, which each server being passed through in an integrated cache system cites, are analysed. The exact method of measuring was only given to paying license holders previously, but since recently has been available to developers and journalists for free, providing that they sign a non-disclosure agreement promising not to let any of this knowledge be used in commercial developments and they don't use exact quotes. This openness allows for an analysis and discussion of the methods used, something which represents a considerable requirement of methods used on the Internet.
Actual accesses not determined
Even though the Rawena method provides more exact results than a pure analysis of the log files with the tools named earlier, or with other ones, this implementation can't provide the actual number of accesses. Ecce Terram sees this problem and therefore provides an additional factor for accuracy. The VDZ (Verband Deustscher Zeitschriftenverleger [German magazine publishers' association]) has licensed this method. Whether or not it's really used though, isn't clear. In August the four media associations (newspaper and magazine publishers, private radio and television stations and also multimedia association) agreed on page views and visits as the "media currency" and are currently recommending Rawena 3.1 for analysis of popularity. However the VDZ in agreement with other media associations is investigating the possibility of developing its own measuring method, which a neutral institution will carry out. Discussions are currently underway with the IVW which, according to the VDZ, will lead to a result before the end of the year.
Webtracking from AC Nielsen isn't as well documented as Rawena. There's just one page on the Web server and it explains the principal terms; the method used is hardly mentioned. But this method is obviously also based on a type of Z Box, which is installed locally by Web providers. More exact questions by telephone were just answered with "company secret". Without knowledge of the method used though, it's impossible to make an assessment of the accuracy, or rather a judgement of the method used.
Manipulating the figures is easy
The Internet community has long known how considerably higher user statistics can be provided, in spite of the introduced measuring methods. Editing a status file can quickly give a simple counter a higher value and lots of graphics, logos or buttons increase the number of hits. Extensive use of frames (Autobild has an extreme example with 11 different frames, the whole thing providing no extra information) also increases the number of page accesses. In an online report by German magazine Fokus there was a headline to the effect "WWW bingo: user statistics as desired" where Bild Online (online version of a German tabloid newspaper) could praise itself with claims of "700 000 page views per week". Because such statistics really do have an effect, Spiegel Online changed its layout to frames, in spite of strong protests from users. However, such a method of counting contradicts the Rawena 3.0 method, which according to Fokus Online is also known to those responsible at Spiegel Online.
After all the methods described have failed to show not just the minimum, but the actual number of accesses to a Web page, there is still the question of whether or not analysis of trends is actually possible. Using traditional log file analysis it's hardly possible, if you consider that the attractive, often-requested WWW objects, that are constantly in caches, can only be recorded indirectly with a correction factor. Experience will show how methods used by Ecce Terram and AC Nielsen are suitable for analysing trends.
The results of a simple log file analysis by means of traditional programs or built-in counter are in any case so inexact that they can't even serve as a basis for estimation. Some questions are still open with the Rawena method, but they can now be discussed thanks to publication of the methods used. With AC Nielsen on the other hand, thanks to the quasi non-existent documentation there are hardly open questions…
Log files should be supplemented
A sensible beginning for the exact recording of accesses would be the creation of a common protocol in which the log file entries of the WWW daemon and of the cache supplement each other. If a company wants to know the exact number of accesses to its WWW server, it can work out the access on account of all the log files. That would even be possible with caches interlinked with each other. The obvious disadvantage of this method is that all the log files from all the caches taking part in an integrated system would have to be activated and this would quickly produce some megabytes per day and server, according to popularity and size of a cache server. But perhaps such analyses would be an interesting gap for a service company to fill.
Agreement on HTTP 1.1, which contains header entries for usage measurements, will also allow more precise measuring. But it will be some time until that happens and this standard might possibly be overtaken - as with HTML 3.2 - by the big browser developers.