next message in archive
next message in thread
previous message in archive
Index of Subjects
This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to mime@docserver.cac.washington.edu for more info. --1920162272-1333332435-856886524=:12656 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII Content-ID: <Pine.SUN.3.91.970225131923.14221C@chebucto.ns.ca> I think this is an issue that should see discussion on CSuite-Dev mailing list, so I have copied it there. There exists a voluntary strategy designed by a group of people who design and operate various Spiders, search robots etc. The information is located on the Webcrawler server (URL below), purely because one of the leaders of the project works there. The strategy is intended to control the indexing of sites, and to reduce the load placed by robots on servers they visit. The strategy depends upon the presence of a file at the root of any web site, i.e. accessible by the URL http://the.site/robots.txt of type "text/plain" The idea is that a spider should first read the robots.txt file, and should, if it pays any attention to the protocol, ignore any URL's _beginning_ with the paths specified as # comment # each block is begins with one of more user-agent specifiers # User-agent is the value sent in the HTTP user-agent header User-agent: robot-name # User-agent * applies to all unspecified agents # NO record implies no limits #Allow and disallow lines Allow: /Help Disallow: /tmp ... The path test succeeds if no difference is found by the end of record, the path is to be treated as _not case sensitive_, and the first match exits the test. The action allow or disallow is taken if the test succeeds, otherwise, if no test succeeds on the URL, then the default is allow. Support is provided that different robots can be allowed different access. This is where we should specify that our lqtext indexer should index particular paths. Again, the first robot that matches takes its commands from that section. For CSuite, the file needs to be placed in the info root of each VCN, and be tailored to that VCN's URL's The key concept is that the protocol is voluntary, and somewhere someone will write a robot that deliberately violates it. On Tue, 25 Feb 1997, David L. Potter wrote: > I've cobbled up a distribution version of robots.txt based on info > located at: > > http://info.webcrawler.com/mak/projects/robots/norobots.html > > contents of (attached) file follows.... > > #================ begin ========================= > > # > # Robots.txt for CSuite distribution Beta 1.0 Version > # Default expiry is 7 days. Until then a robot can cache the file. Is that reasonable? Do we have the hooks to change it? (put the content inside a script, and make the "robots.txt" a soft-link to that script?) There needs to be one or more user-agent lines, or this will be ignored. > Disallow:/Local # local documents > Disallow:/csuite # webcrawlers should point to Chebucto for current version of these Implication: the above line should not be present on CCN > Disallow:/Policy # these document picked up under /Help That does not work. The idea is that the robot will look at a page, and follow all the links. But if a Help page points to a page in /Policy, this would prevent retreival of that page. > Disallow:/Services # these document picked up under /Help As above. > Disallow:/adm # restricted processes Good, but see my note about browsers that deliberately flout the recommendations in an effort to break in to a site. The authentication request should stop the robot from indexing the page. > Disallow:/donors # local I wonder about that! > Disallow:/recent # changes maybe, but think about a robot that might specifically search for recent stuff. Maybe ought to have a separate section for that kind of robot. > Disallow:/Copyright # webcrawlers should point to Chebucto for this What is there? > Disallow:/Home.html~ # backup Should backup files (.html~) be in the allowed document types for the server? > Disallow:/Home.tmpl # template Should this be servable? > Disallow:/Memopts.html # local document On CCN, this is in /Local > Disallow:/ips.html # rebuilt regularly True, but it's a good item to have on a search engine! All the change does is make a small part of the content stale. > Disallow:/jumps.html # changes ocassionally Maybe so, but whats wrong with it being indexed? Perhaps in a section restricted to specific robots. > Disallow:/motd # changes Agreed, not because it changes, but because the content is not helpful to a search user. > > #================= end ========================== > > --------------------------------------------------------------------- > David Potter http://chebucto.ns.ca/CSuite/CSuite.html > Documentation Team Chebucto Community Net > ============== CSuite - Community Network Software ================== > Ed Dyer aa146@chebucto.ns.ca (902) H 826-7496 CCN Assistant Postmaster http://www.chebucto.ns.ca/~aa146/ W 426-4894 CSuite Technical Workshop --1920162272-1333332435-856886524=:12656 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII; NAME="robots.txt" Content-Transfer-Encoding: BASE64 Content-ID: <Pine.SUN.3.91.970225120204.12656B@chebucto.ns.ca> Content-Description: Iw0KIyBSb2JvdHMudHh0IGZvciBDU3VpdGUgZGlzdHJpYnV0aW9uIEJldGEg MS4wIFZlcnNpb24gDQojDQpEaXNhbGxvdzovTG9jYWwJCSMgbG9jYWwgZG9j dW1lbnRzDQpEaXNhbGxvdzovY3N1aXRlCSMgd2ViY3Jhd2xlcnMgc2hvdWxk IHBvaW50IHRvIENoZWJ1Y3RvIGZvciBjdXJyZW50IHZlcnNpb24gb2YgdGhl c2UgDQpEaXNhbGxvdzovUG9saWN5CSMgdGhlc2UgZG9jdW1lbnQgcGlja2Vk IHVwdW5kZXIgL0hlbHANCkRpc2FsbG93Oi9TZXJ2aWNlcwkjIHRoZXNlIGRv Y3VtZW50IHBpY2tlZCB1cHVuZGVyIC9IZWxwDQpEaXNhbGxvdzovYWRtCQkj IHJlc3RyaWN0ZWQgcHJvY2Vzc2VzDQpEaXNhbGxvdzovZG9ub3JzCSMgbG9j YWwNCkRpc2FsbG93Oi9yZWNlbnQJIyBjaGFuZ2VzDQpEaXNhbGxvdzovQ29w eXJpZ2h0ICAgCSMgd2ViY3Jhd2xlcnMgc2hvdWxkIHBvaW50IHRvIENoZWJ1 Y3RvIGZvciB0aGlzDQpEaXNhbGxvdzovSG9tZS5odG1sfiAJIyBiYWNrdXAN CkRpc2FsbG93Oi9Ib21lLnRtcGwgCSMgdGVtcGxhdGUNCkRpc2FsbG93Oi9N ZW1vcHRzLmh0bWwJIyBsb2NhbCBkb2N1bWVudA0KRGlzYWxsb3c6L2lwcy5o dG1sICAgCSMgcmVidWlsdCByZWd1bGFybHkNCkRpc2FsbG93Oi9qdW1wcy5o dG1sCSMgY2hhbmdlcyBvY2Fzc2lvbmFsbHkNCkRpc2FsbG93Oi9tb3RkCQkj IGNoYW5nZXMNCg== --1920162272-1333332435-856886524=:12656--
next message in archive
next message in thread
previous message in archive
Index of Subjects