Sent from: qix@desire.apana.org.au (Mitchell Porter) This has evidently been clipped from the RISKS mailing list. -mitch --------------- > Date: Sun, 4 Feb 96 20:50:03 CST > From: jdellinger@amoco.com (Joe A. Dellinger) > Subject: Risks of web robots Here are three risks of "web robots" I've run across recently that I think Risks readers might find interesting. 1) The first is probably already well known to Risks readers: password files accidentally being exported to the world. Web servers are just yet another way of making that mistake. Here is a post that has already had wide circulation (and may have already appeared in Risks... I'm unable to scan back issues to check right now because of heavy network load): >Subject: BoS: Misconfigured Web Servers > > A friend of mine showed me a nasty little "trick" over the weekend. He > went to a Web Search server (http://www.altavista.digital.com/) and > did a search on the following keywords - > > root: 0:0 sync: bin: daemon: > > You get the idea. He copied out several encrypted root passwords from > passwd files, launched CrackerJack and a 1/2 MB word file and had a > root password in under 30 minutes. All without accessing the site's > server, just the index on a web search server! > .... > > The guy that showed me this found it funny, but I find it disturbing. > Are there that many sites that are that poorly configured? > > Mark_W_Loveless@smtp.bnr.com I just verified that indeed this search does work, although to my relief the majority of the "hits" found are legitimate documents discussing UNIX security. The risks are fairly obvious. 1') Here is a variation on the above risk that I HAVEN'T seen discussed before, however. See what happens if you search AltaVista for THESE keywords: "unpublished proprietary source code actual intended reserved copyright notice" The results of this search are even more frightening, at least to me. The general risk is not just that you can conveniently find password files, but ANY kind of document that shouldn't be widely distributed: material useful for breaking into your system, copyrighted material, illegal material, libelous material, incriminating or embarrassing material, etc... 2) The second risk works the other way: fooling stupid web robots so as to lure people to your web site. A month ago I tried searching for "eisner reciprocity paradox" on WebCrawler, hoping to find that it had indexed a paper of mine that I had reprinted electronically under my home page. Nope, it hadn't (or at least I was unable to find it using any of the likely keywords I could think of!). Instead the single match was on a URL intriguingly entitled "The information source". Gee, this "information source" must have an article in it about Eisner's Reciprocity Paradox, one that I hadn't known of before! So I followed the link, and ended up at something unexpected: "http://www.graviton.com/red/", "The Red Herring Home Page"! (It comes complete with gifs of red fish!) A little experimentation revealed that almost ANY obscure search would match "The information source", often as the only matching document found. As near as I could figure out, his site recognized probes by web robots and then threw a dictionary at them! (His point made, he has since stopped, although the Red Herring page is still there for your perusal.) I contacted the author, Tom White, and asked for more details. He didn't want to give his secrets away, but did reply: > I will say that I spent no more than an hour on the whole thing, including > writing the page, and it was effective far beyond what I thought a silly > trick like that would muster. I think that by virtue of not hiding what > I am trying to do, people who write web indexers may see the page and think > of ways to subvert feeble attempts like mine - which is a good thing since > the page could have as easily been any propaganda I wanted to push on people. The risk? It can be frustratingly difficult (or impossible) to get a web robot's attention for a legitimate page you WANT indexed, or to find a page you know is there amist all the distractions of "false hits". Part of the clutter may be wildly off-topic pages engineered to fool web robots into thinking that almost anything matches them. (Or simply long rambling pages containing lots of poems and such... documents that "fool" the robots more by accident than design.) 3) Finally, the act of being searched can cause problems for certain kinds of sites: ones that carry hundreds of thousands of distinct URLs, often generated only on demand, and that don't expect any one site to ever have reason to download ALL of them, whether all at once or a few at a time. See for example "http://xxx.lanl.gov/RobotsBeware.html". The authors state there: "This www server has been under all-too-frequent attack from `intelligent agents' (a.k.a. `robots') that mindlessly download every link encountered, ultimately trying to access the entire database through the listings links. In most cases, these processes are run by well-intentioned but thoughtless neophytes, ignorant of common sense guidelines." They have been forced to take a "proactive" stance to protect themselves: "We are not willing to play sitting duck to this nonsensical method of `indexing' information." The rather UNIQUE hot link that follows, "(Click here to initiate automated `seek-and-destroy' against your site.)", doesn't actually do anything but pause for 30 seconds, I'm told... I'll let readers examine the page and draw their own Risks! ------------------------------------ Slev -- "What is life? It is the flash of a firefly in the night. It is the breath of a buffalo in the winter time. It is the little shadow which runs across the grass and looses itself in the Sunset." Crowfoot, Blackfoot Elder Bow River, Canada -- -mitch http://desire.apana.org.au/~qix