This how-to covers various techniques for preventing pages from
being included in your search engine index.
It also covers how to prevent one or more links on a page from
being followed by the spider (indexer)
and how to prevent pages from appearing in your site map and what's new page.
You can tell the spider (indexer) to ignore parts of your site in a few different ways.
In order of preference you can either use the Control Center or
use a "robots.txt" file.
If you want to prevent only part of a page from being indexed you can
use special search engine HTML tags.
You can also make the spider ignore specific links or types of links.
Since the spider finds the pages of your site by following links, preventing
it from following specific links can be used to prevent parts of your
site from being indexed. This technique is also used to change the
structure of your site map by changing how the spider thinks your
site is linked together.
The sections below cover your various options.
Excluding Pages using the Control Center
This is the preferred way of preventing pages from being included in your index.
To do this, you simply
log in to your account,
go to the
page and use the exclude pages link.
When the wizard appears add your list of "exclusions", one per line (browser wrapping may be ignored),
and press the
button to save your changes.
Each exclusion consists of a "URL mask" optionally followed by one or more exclusion modifiers.
The URL mask is simply a standard web address, but may contain the common wildcards
"*" and "?" to make it
match more than one web address.
The "*" will match any number of any character and
the "?" will match any single character.
Non-wildcard characters are matched without regard to case (case-insensitive).
URL masks which do not begin with "http://"
are treated as if they begin with "*".
Because of this it is recommended that you include the "http://"
in your URL masks.
The URL mask may be followed by exclusion modifiers. There are two:
index=no/yes
follow=no/yes
The "index" modifier specifies
whether pages matching the mask will be included in the index.
The "follow" modifier specifies
whether pages matching the mask will have their links followed in order
to locate other pages to index.
The default values are:
index=no follow=no
When determining which exclusion to apply, entire list of exclusions is considered
and the last matching exclusion is used. This allows convenient expression
of "exclude everything but..." logic. For example, to prevent everything in your
"http://example.com/cgi-bin/" directory from
being index except pages generated by the CGI
"content.cgi"
you can use the following:
prevents that file from being included in the index.
The exclusion:
http://example.com/archive/*
prevents everything in the "archive" directory from being included in the index.
The exclusion:
/archive/*
prevents everything in any "archive" directory from being included in the index
regardless of the site it's on.
The exclusion:
http://example.com/*.txt
prevents files on "example.com" that end with the extension ".txt" from being included in the index.
The exclusion:
*.txt
prevents all files that end with the extension ".txt" from being included in the index
regardless of what site they're on.
The exclusion:
http://example.com/alphaindex/?.html
prevents a file like "http://example.com/alphaindex/a.html"
from being indexed, but would allow a file
"http://example.com/alphaindex/aardvark.html"
to be indexed.
prevents a file like "http://example.com/alphaindex/a.html"
from being added to the index but would allow the spider to find and follow the links in that page.
allows that file to be added to the index but prevents the spider from following any of the links in that file.
Excluding Pages using Robots.txt
If for some reason you cannot
use the Control Center to exclude pages
then you can use a
robots.txt file
to make the spider ignore certain parts of your site.
This mechanism is more limited than Control Center exclusions,
and so should typically not be used.
Excluding Part of a Page
You can use special search engine tags to prevent part of a page
from being indexed.
If you want to prevent an entire page from being indexed then see
Excluding Pages Using the Control Center.
To prevent part of a page from being indexed add the tag
<!-- FreeFind Begin No Index -->
before the section of the page to be ignored, and the tag
<!-- FreeFind End No Index -->
after the section of the page to be ignored, then respider your site.
The spider, when it notices this tag, will prevent the text occurring between these
tags from being included in the index.
Note that the spider will still follow the links on this page to locate the other pages
of your site.
Note that the spider does not use the "noindex" robots meta tag
(<meta name="robots" content="noindex">)
even though it does pay attention to the "nofollow" robots meta tag.
Preventing Links from Being Followed
Since the spider determines the pages of your site by following all the links it can,
preventing it from following links can change which parts of your site get indexed.
The spider uses the standard "nofollow" robots meta tag
(<meta name="robots" content="nofollow">)
although it does not pay attention to the "noindex" robots meta tag.
By default, the spider is very thorough in its link detection and use. In addition to
using a variety of generic links, it will also try to extract links from any javascript on
your page and follow links which contain query strings in them (like
"/cgi-bin/doit.cgi?page=1&option=2").
Robust javascript link extraction is essentially impossible so the javascript link extractor
may make some invalid guesses. If your server is receiving a lot of "404 page not found" requests
when the spider runs, you may want to consider turning off the javascript link extraction.
All this can be customized.
The sections below outline the various technique available in order to
control the spider link-finding process.
Preventing all links in javascript from being followed
to the very start of the first page the spider reads. After the spider processes this tag it
will ignore all javascript in the current page and all subsequent pages.
Preventing links in javascript on a single page from being followed
Add the tag
<meta name="FreeFind" content="noFollowScript">
to the very start of the page.
This causes the spider to ignore all javascript after the tag, for the current page only.
You can also enable javascript link processing for the current page
by using the tag
<meta name="FreeFind" content="followScript">
at the very start of a page.
Preventing links in selected javascript from being followed
Add the tag
<nofollowscript>
before the javascript the spider should ignore, and
</nofollowscript>
after the javascript the spider should ignore.
Note that these tags should not be in the javascript itself.
Making the spider ignore your robots meta tags
Add the tag
<meta name="FreeFind" content="noRobotsTag">
to the very start of the first page the spider reads (before any existing robots meta tag if any!).
After the spider processes this tag it will ignore all robots meta tags in the current page and all subsequent pages.
Preventing all the links in a page from being followed
Add the tag
<meta name="FreeFind" content="nofollow">
to the very start of the page.
This causes the spider to ignore all links in the page, both before and after the tag, for the current page only.
Preventing specific links from being followed
Add the tag
<!-- FreeFind nofollow -->
before the link(s) the spider should ignore, and
<!-- FreeFind end nofollow -->
after the link(s) the spider should ignore.
Preventing links with query strings from being followed
Add the tag
<meta name="FreeFind" content="noQueries">
to the very start of the first page the spider reads.
After the spider processes this tag it will ignore all links with query strings.
Stripping the query string off links before following
Add the tag
<meta name="FreeFind" content="stripQueries">
to the very start of the first page the spider reads.
After the spider processes this tag it will remove any query strings from the links it follows.
Excluding Pages from the Site Map
To prevent a page from being included in the site map add the tag
<!-- FreeFind No Map -->
to that page then respider your site.
This tag often removes all of the pages linked "beneath" the current page from
the site map as well. If the only link to a page was on the "no mapped" page,
that page will not appear in the site map either.
Excluding Pages from What's New
To prevent a page from being included in the what's new page add the tag