Too Many Pages Indexed
If the search engine located more pages than you expected look here for the most common reasons why.
TOP QUESTIONS
You are indexing a calendar
Time goes on forever, so most automatic calendar programs will generate an infinite number of pages!
The solution is to prevent the spider from indexing your calendar, or to limit the pages it does index.
This can be done using the standard techniques for preventing parts of your site from being indexed.
For more information on this, read
How to Exclude Pages from Search.
You are indexing a forum
Forums can generate many more pages that you might expect.
This is because most forum software not only displays user messages,
but has a rich set of supporting features, such as "reply to", "new thread",
"show user", "edit user", "admin", "register", etc, etc. The net effect of this is
that most of the pages indexed in a forum are junk pages - pages which you really
don't want to show up in search results.
The solution here is to prevent the spider from indexing your forum,
or to limit the pages it does index to only the messages.
For tips on how to do this, including specific instructions for some
popular forums, read
Searching Forums.
You are indexing other dynamically-generated pages
FreeFind can index dynamically-generated pages without problem.
Sometimes it is too good, though, and finds more of these pages that you expected.
This is because dynamic-content generators often generate pages
for support functions, like program administration, etc.
The net effect of this is that some (possibly a lot) of the pages indexed
are "junk" pages - pages which you really don't want to show up in search results.
The solution here is to limit the pages the spider indexes, excluding the
junk pages that you don't want included.
This can be done using the standard techniques for preventing parts
of your site from being indexed.
For more information on this, read
How to Exclude Pages from Search.
You are indexing your directory listings in addition to your pages
When given a web address that refers to a directory (and not a specific
file), most web servers will first look in that directory to find a
"welcome" (or "default") file to display (typically index.html).
If a welcome file cannot be located some servers will then automatically
generate a page which lists the files in that directory - a directory
listing.
In general, directory listings are considered a possible security problem
and are considered to be a bad idea.
If your site has a link to a directory listing, the FreeFind spider will
locate it and start indexing all the web directories on your server.
Since most directory listings have links to sort each list a few
different ways, this generally results in lots and lots of "junk" pages
being indexed and potentially included in search results.
To determine if you are indexing directory listings, you can both look at
the site map and try some searches. Most directory listing pages have a
title like directory of ... or listing of ... or
index of .... By searching for the first two words of these titles
you can usual locate any directory listings you may have.
If you are indexing directory listings, the solution is to track down the
original link(s) to the directory listing in your website and fix those so they refer to
pages. You will probably want to do this regardless of the search engine
to improve the security of your site.
You are indexing both "www.example.com" and "example.com"
Some folks have "www" links and "non-www" links throughout their site
as if they are the same. Then, to get their site completely indexed
they have to add the other website address to the spider's list of
additional starting points. Sometimes this works, but often it results
in spidering twice as many pages as expected.
If this is happening to you, the fix is to standardize on one or the
other; either use "www" in all your site's links or don't. Do not mix and
match. As a bonus, fixing this problem may improve how well your site is indexed
by the big web search engines.
MORE QUESTIONS
You have specified additional spider starting points
This might be a problem for two reasons:
- You've forgotten about them.
-
If you set up your search engine some time ago, you may have simply forgotten
which sites you have instructed it to index.
- You intend them to be single pages, not sites.
-
You may not realize it, but each starting point address refers to a site,
not a single page. Just like your primary account address, the spider will
index the entire site at each starting point. If you are expecting it to just index
the starting point page, the spider will probably end up indexing a lot more pages than
you anticipate.
It is possible to setup an "exclusion" to make the spider treat a starting point like a page
instead of a site. For information on this read
How to Exclude Pages from Search.
You have bad links and your server is not returning the correct error code
Although most servers accurately detect requests for non-existent pages,
some servers either:
- return an error page but don't set the error code, or
- return a page from your site and don't set the error code
If your server is doing the former you may end up with a few extra pages
corresponding to the bad links in your site. Usually searching for:
page not found
will locate these types of pages. You can also try searching for:
404
and see if any pages are listed.
If your server is doing the latter and you have a certain type
of error in one or more of your links, you can end up with an infinite
number of pages being indexed. The type of bad link which can do this
is one that ends with a slash but actually refers to a page, like:
oops.html/
In both cases fixing the links then reindexing your site will fix the problem.
|