🦋 Filtering out page reads
You get a lot of stuff in your web server log file that does not have to do with actual human reads of your site. I wrote a script that I think shows all the human page views in an Apache log file. It relies on that browsers get css stylesheets, while robots generally don't. (It will miss humans using Lynx; it could easily be tweaked to fix that enough. Also, I have seen Yahoo getting css files; you can fix that by putting "Slurp" in the list of files you're not interested in.)
grep "blog.css" $logfile | // get all reads
of blog.css
awk '{print $1;}' | // extract ip address
sort | uniq | // only show each ip once
grep -f - $logfile | // now pass that list
of ip's back to grep
grep " 200 " | // only show successful reads
egrep -v (any files you're not interested in)
I believe you could also use "favicon.ico" instead of your css file, but this is less reliable -- I don't know how often browsers request favicon for sites they have already visited. Or you could use the filename of a graphic included on one of your pages and hosted on your site, I think this would work reasonably well.
posted evening of Tuesday, November 20th, 2007 ➳ More posts about The site ➳ More posts about Programming Projects ➳ More posts about Projects
|