robots.txt rus

Date: Tue, 25 Feb 1997 14:13:59 -0400
From: Edward Dyer <aa146@chebucto.ns.ca>
To: csuite-tech@chebucto.ns.ca
cc: CSuite Development List <csuite-dev@chebucto.ns.ca>

next message in archive
next message in thread
previous message in archive
Index of Subjects


  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.
  Send mail to mime@docserver.cac.washington.edu for more info.

--1920162272-1333332435-856886524=:12656
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Content-ID: <Pine.SUN.3.91.970225131923.14221C@chebucto.ns.ca>

I think this is an issue that should see discussion on CSuite-Dev mailing 
list, so I have copied it there.

There exists a voluntary strategy designed by a group of people who 
design and operate various Spiders, search robots etc. The information is 
located on the Webcrawler server (URL below), purely because one of the 
leaders of the project works there.  The strategy is intended to control 
the indexing of sites, and to reduce the load placed by robots on 
servers they visit.

The strategy depends upon the presence of a file at the root of any web
site, i.e. accessible by the URL http://the.site/robots.txt of type
"text/plain"  The idea is that a spider should first read the robots.txt
file, and should, if it pays any attention to the protocol, ignore any
URL's _beginning_ with the paths specified as

# comment
# each block is begins with one of more user-agent specifiers
# User-agent is the value sent in the HTTP user-agent header
User-agent: robot-name
# User-agent *   applies to all unspecified agents
# NO record implies no limits

#Allow and disallow lines
Allow:      /Help
Disallow:   /tmp
...

The path test succeeds if no difference is found by the end of record, the
path is to be treated as _not case sensitive_, and the first match exits
the test.  The action allow or disallow is taken if the test succeeds, 
otherwise, if no test succeeds on the URL, then the default is allow.

Support is provided that different robots can be allowed different access.
This is where we should specify that our lqtext indexer should index 
particular paths.  Again, the first robot that matches takes its commands 
from that section.

For CSuite, the file needs to be placed in the info root of each 
VCN, and be tailored to that VCN's URL's

The key concept is that the protocol is voluntary, and somewhere someone 
will write a robot that deliberately violates it.

On Tue, 25 Feb 1997, David L. Potter wrote:

> I've cobbled up a distribution version of robots.txt based on info 
> located at:
> 
> http://info.webcrawler.com/mak/projects/robots/norobots.html
> 
> contents of (attached) file follows....
> 
> #================ begin =========================
> 
> #
> # Robots.txt for CSuite distribution Beta 1.0 Version 
> #

Default expiry is 7 days.  Until then a robot can cache the file.  Is that 
reasonable?  Do we have the hooks to change it? (put the content inside a 
script, and make the "robots.txt" a soft-link to that script?)

There needs to be one or more user-agent lines, or this will be ignored.

> Disallow:/Local		# local documents
> Disallow:/csuite	# webcrawlers should point to Chebucto for current version of these 

Implication: the above line should not be present on CCN

> Disallow:/Policy	# these document picked up under /Help

That does not work.  The idea is that the robot will look at a page, and 
follow all the links.  But if a Help page points to a page in /Policy, 
this would prevent retreival of that page.

> Disallow:/Services	# these document picked up under /Help

As above.

> Disallow:/adm		# restricted processes

Good, but see my note about browsers that deliberately flout the 
recommendations in an effort to break in to a site.  The authentication 
request should stop the robot from indexing the page.

> Disallow:/donors	# local

I wonder about that!

> Disallow:/recent	# changes

maybe, but think about a robot that might specifically search for recent 
stuff.  Maybe ought to have a separate section for that kind of robot.

> Disallow:/Copyright   	# webcrawlers should point to Chebucto for this

What is there?

> Disallow:/Home.html~ 	# backup

Should backup files (.html~) be in the allowed document types for the server?

> Disallow:/Home.tmpl 	# template

Should this be servable?

> Disallow:/Memopts.html	# local document

On CCN, this is in /Local

> Disallow:/ips.html   	# rebuilt regularly

True, but it's a good item to have on a search engine!
All the change does is make a small part of the content stale.

> Disallow:/jumps.html	# changes ocassionally

Maybe so, but whats wrong with it being indexed?
Perhaps in a section restricted to specific robots.

> Disallow:/motd		# changes

Agreed, not because it changes, but because the content is not helpful 
to a search user.

> 
> #================= end ==========================
> 
> ---------------------------------------------------------------------
> David Potter                 http://chebucto.ns.ca/CSuite/CSuite.html
> Documentation Team                             Chebucto Community Net
> ============== CSuite - Community Network Software ==================
> 

Ed Dyer aa146@chebucto.ns.ca   (902) H 826-7496  CCN  Assistant Postmaster
http://www.chebucto.ns.ca/~aa146/    W 426-4894  CSuite Technical Workshop

--1920162272-1333332435-856886524=:12656
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII; NAME="robots.txt"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.SUN.3.91.970225120204.12656B@chebucto.ns.ca>
Content-Description: 

Iw0KIyBSb2JvdHMudHh0IGZvciBDU3VpdGUgZGlzdHJpYnV0aW9uIEJldGEg
MS4wIFZlcnNpb24gDQojDQpEaXNhbGxvdzovTG9jYWwJCSMgbG9jYWwgZG9j
dW1lbnRzDQpEaXNhbGxvdzovY3N1aXRlCSMgd2ViY3Jhd2xlcnMgc2hvdWxk
IHBvaW50IHRvIENoZWJ1Y3RvIGZvciBjdXJyZW50IHZlcnNpb24gb2YgdGhl
c2UgDQpEaXNhbGxvdzovUG9saWN5CSMgdGhlc2UgZG9jdW1lbnQgcGlja2Vk
IHVwdW5kZXIgL0hlbHANCkRpc2FsbG93Oi9TZXJ2aWNlcwkjIHRoZXNlIGRv
Y3VtZW50IHBpY2tlZCB1cHVuZGVyIC9IZWxwDQpEaXNhbGxvdzovYWRtCQkj
IHJlc3RyaWN0ZWQgcHJvY2Vzc2VzDQpEaXNhbGxvdzovZG9ub3JzCSMgbG9j
YWwNCkRpc2FsbG93Oi9yZWNlbnQJIyBjaGFuZ2VzDQpEaXNhbGxvdzovQ29w
eXJpZ2h0ICAgCSMgd2ViY3Jhd2xlcnMgc2hvdWxkIHBvaW50IHRvIENoZWJ1
Y3RvIGZvciB0aGlzDQpEaXNhbGxvdzovSG9tZS5odG1sfiAJIyBiYWNrdXAN
CkRpc2FsbG93Oi9Ib21lLnRtcGwgCSMgdGVtcGxhdGUNCkRpc2FsbG93Oi9N
ZW1vcHRzLmh0bWwJIyBsb2NhbCBkb2N1bWVudA0KRGlzYWxsb3c6L2lwcy5o
dG1sICAgCSMgcmVidWlsdCByZWd1bGFybHkNCkRpc2FsbG93Oi9qdW1wcy5o
dG1sCSMgY2hhbmdlcyBvY2Fzc2lvbmFsbHkNCkRpc2FsbG93Oi9tb3RkCQkj
IGNoYW5nZXMNCg==
--1920162272-1333332435-856886524=:12656--

next message in archive
next message in thread
previous message in archive
Index of Subjects