[ Email | Xavatoria Main Page | CGI Scripts | Back To SoCalHoops Archive Search ]



This search engine is designed for scanning large, text-intensive websites. Installation and maintence are a little more involved than for other internal search engines. However, Xavatoria pays off with speed and flexibility for your visitors.

This script uses new Meta tags, such as keywords and description. The list of hits for a particular query is weighted by relevance, and only a set number of links are displayed per page.

Xavatoria allows for complex but easy to use Boolean operators, grouping, case control, and basic wildcard searches. It includes both a setup guide for webmasters and a searching guide for visitors.

In theory, this script should work on Unix, Windows NT, and other servers with Perl 5 installed (maybe even Perl 4). However, since I have access to only a few Unix servers and one NT server, I cannot guarantee that it will work on any specific platform.

Search the Perl Documentation
The large Perl archives (615 files consuming 4,207 kb) make for a good testing ground for CGI scripts. Our demo will eventually allow you to search the same archive with various search engines, which have been modified to output their execution time in addition to hits.

View the Search Code (search.pl)
View the Index Building Code (build.pl)
Download the Code and Instructions (xavatoria.zip)

 
Help File for Xavatoria Indexed Search



 
[TOP]
Why Would I Want to Use Xavatoria?


Complete Webmaster Control
You decide exactly which directories and which file types are scanned. Which is nice. But you can also designate certain key files to appear higher in the rankings. Should every visitor see your new product layout? Should everyone experience that new Java demonstration? Now they will. In additon, Xavatoria keeps a record of all queries to let you know what visitors are really looking for - in their own words.

Multi-Platform Support
Xavatoria operates on Unix and Windows NT servers (and probably others too). The directory conventions and environment variables can be auto-detected or specified, so only a minimum number of settings must be customized.

Keep the Masses Happy
A searchable site is much preferred by visitors, especially those who know what they're looking for. The ability to use highly specific search language to retreive a sorted list of hits makes Xavatoria the tool of choice for scanning large websites. The rapid, efficient execution will keep both your visitors and sysadmins happy.

 
[TOP]
System Requirements


To run Xavatoria, you'll need a web page that allows custom CGI in the Perl scripting language.

Depending on the power of your server, Xavatoria will begin to slow down excessively when scanning more than five or ten megabytes of text. If you need to scan a larger set, you may wish to try some of the more powerful engines listed at www.cgi-resources.com.

 
[TOP]
Installation and Configuration

Using Xavatoria requires two executable CGI scripts, one data file, and one normal HTML page from which to start searching.

The data file (index.txt) is created by the index building script, build.pl. The build script requires, naturally, the location where the new site index file is to be placed, and also the location of all the files on your system to be searched. Below is a description of these variables - these descriptions also appear in the comments of the script.

The location on your system of the new index file - this file must be writable.

	$Index_File = "/u2/home/xav/search/index.txt";


$baseurl is the web page that corresponds with files in your base directory, which is called $basedir. Use absolute paths like shown, not relative paths. Do NOT include trailing slashes.

	$baseurl = "http://www.xav.com";
	$basedir = "/usr/www/users/xav";


$extensions holds the file extensions that will be included in searches. It's best to leave out ones like ".log" or ".cgi". Note the special "\." delimiters, and their occurrence at both the beginning and end of the set.

	$extensions = "\.html\.htm\.stm\.ztml\.shtml\.";


Below are the files or directories that you do NOT want to be searched. Note that they all have one blank space after the file or directory, and that directories do not include trailing slashes. Also note that we use the ".=" instead of the "=".

	$DMZ .= "/usr/www/users/xav/secure ";
	$DMZ .= "/usr/www/users/xav/cgi-bin ";
	$DMZ .= "/usr/www/users/xav/counters ";


Placing a directory in the DMZ will prevent searching in both that directory and all internal sub-directories.

That takes care of files that we don't want visitors to see. Now we list the files that your visitors really should see (why not show off your best work, eh?). While the search results remain honest as to whether or not terms were found, these files will have their numerical ranking multiplied by the CRANK FACTOR, which should be an integer between two and twenty. Only files can go here, not directories. The same rules from above apply...

	$CRANK_FACTOR = 18;
	$CRANK .= "/usr/www/users/xav/links.ztml ";
	$CRANK .= "/usr/www/users/xav/scripts/index.html ";
	$CRANK .= "/usr/www/users/xav/clients.html ";


All occurrences of a search term count as one point. The occurrence of a term in the URL, Title, Meta keywords, or Meta description can have added weight (equivalent to a multiplier per hit). Enter the multipliers in the array below - the defaults are (4, 10, 10, 4). If this makes no sense to you, just ignore it and leave the defaults as they are - they work pretty well. Note that this will give extra weight to those pages having properly formatted Title and Meta tags, even if they contain the same basic information (kinda like the real search engines).

	$Filename_Rank = 4;
	$Title_Rank = 10;
	$Keyword_Rank = 10;
	$Description_Rank = 4;


That takes care of build.pl. The only other thing to do is upload it to the server in ASCII format and make it executable (chmod 755 build.pl).

Now we move on to the search script, search.pl. Once again, the variable descriptions which follow are repeated in the comments of the script to remind you.

Location of your index file - identical to the variable in the previous script.

	$Index_File = "/u2/home/xav/search/index.txt";


Now specify words that are generally ignored in a search query. Note that users can still search for these words by using a special character before them, i.e. "+search", or by including them in a quoted statement.

	$Ignore .= " what how who which when where do you find site get and ";
	$Ignore .= " or if not a the for an it of from by the one two to he ";
	$Ignore .= " most all about i me search is are be been with why ";


Now we specify the URL to your main search page with tips and so on.

	$Search_Page = "http://www.xav.com/scripts/xavatoria/search.html";


How many links should the viewer see per page?

	$Hits_Per_Page = 10;

On every page we want a hyperlink to the main web page, so users don't have to hit the back button fifty times to get back home:

	($Link_URL,$Link_Title) = ("http://www.xav.com","Fluid Dynamics Main Page");

Now you can completely customize the HTML for the top of the search results page. Ours is pretty generic and should work if you aren't too creative. Edit only between the lines containing "EOM":

	$Header = <<EOM;
	<HTML>
	<HEAD>
	<TITLE>Search Results</TITLE>
	</HEAD>
	<BODY BGCOLOR="#FFFFFF">
	<H2><TT>Xavatoria Search Results</TT></H2>
	EOM


Now specify the HTML to give searchers who can't seem to find anything. Usually this should include yet another link back to the tips page. If you're a business that is not overly annoyed by people sending you email, it'd be nice to have a mailto link for them to request information from a human.

	$No_Documents_Found = <<EOM;
	<BLOCKQUOTE>
	<B>Unfortunately, we didn't find any documents which matched your 
	search terms. You may want to visit our <A HREF="search.html">search 
	tips</A> page to better refine your queries.<BR>
	<BR>If all else fails, write to us at <A HREF="mailto:noc\@xav.com">
	noc\@xav.com</A> and we'll assist in your search.</B>
	</BLOCKQUOTE>
	EOM


Finally, enter the location of your summaries file. It holds information about who searched for what, what they found, and so on. This is pretty good feedback on your visitors. Make this one writable (chmod 777 summaries.html):

	$summary_file = "/u2/home/xav/summaries.html";


That takes care of search.pl. Upload it to your site as ASCII text and make it executable (chmod 777 search.pl). While you're at it, upload the starter index.txt file and the search.html page to so that visitors can start searching. You may want to customize the search examples to match your theme.

Finally, you have to use build.pl to create the site's index.txt file. Make sure that your starter index is writable (chmod 777 index.txt) and go to the URL of the build.pl script. Just visiting the URL will build the site index, and then you should be set. You'll want to rebuild the index after major changes in content.

 
[TOP]
Trouble-shooting

If you get a "malformed header" or "premature end of script headers" message, it may be because the script was transferred as a binary file at some point (which scrables the hidden end-of-line characters and confuses the server - always transfer scripts in ASCII format). If you open the file with Pico, create and delete a line, and then save it, the problem usually goes away.

Other reasons for this may be the location (or existence) of Perl on your system, which is reflected in the first line of the code. My scripts are written as if Perl existed at "/usr/bin/perl". Another common location is "/usr/local/bin/perl". To find out where they've hidden the Perl executable, either ask your administrator or type "whereis perl" from Telnet.

The most common problem is to not have permissions set correctly, resulting in the 403 Permission Denied message. Make sure the script is readable and executable by everyone (chmod 755 search.pl). Also make sure that the summaries file is writable (chmod 777 summaries.html).

Other error messages should be automatically generated by the scripts, and should be pretty self-explanatory. If you have read and followed the instructions here and Xavatoria still does not work, there isn't much hope that it ever will. Even so, Fluid Dynamics will provide free, limited support via email for this script. Send requests for assistance to noc@xav.com. Please include the relevant URLs in your message, and cut & paste the telnet response to the "perl -w search.pl" and "perl -w build.pl" commands if possible. Mail sometimes will wait for two or three months before being responded to. Note that we will not support nor respond to those operating sites whose material offends (adult sites & the like).

Custom installation is available at a reasonable rate; custom coding currently runs $40/hour and we have a it-works-or-its-free guarantee. Installing this script would take less than an hour.

 
[TOP]
Copyright Information and Notification of Updated Versions


Xavatoria is strict freeware. You may use it for your own personal or commercial purposes, but you may not charge to host nor install it. Everything has to be free. Even though you can use it for free, it is still copyrighted by Fluid Dynamics and cannot be distributed without permission. Distribution requests should be directed to noc@xav.com.

Because I have a full-time job at Burger King, I have enough money to pay rent and am able to distribute scripts for free. Somtimes, to make sure I have enough money to pay rent, I don't buy food. And so I sit in my one bedroom apartment with my roommates and create search engines on an empty stomach (it builds character, you know.) I suspect that some who use and enjoy Xavatoria will be from companies which typically pay for such services. If those people could pass a dollar or two my way, or even some canned goods, I would be very greatful. Enough said.

If you'd like a notification via email when an updated version is released, write to noc@xav.com and say so. The data structures for this particular script were written to allow a (thus far unreleased) robot script to search remote sites and add them to the database, without any need for changing the rest of the code. This beast will probably be done some time in July. To help drive this and other new updates, please send any suggestions and bug reports to the address above.

 
[TOP]
Credits and Stuff


Many thanks are due to the Altavista search engine. Though all of the code for Xavatoria is more or less original, the Altavista search engine provided an excellent guide as to what features and limitations would be useful and necessary. The friendlier output features of Altavista, Yahoo, and Infoseek were combined to make an interface that is familiar to the average web user.

Thanks also go to NW Nexus for partially subsidizing the development and providing an NT web account for testing. Also kudos to Tom Christiansen for creating the four megabyte Perl 5 documentation which was used both to test the large-archive scanning features and to come up with tighter search algorithms. Jeff Carnahan wrote the code for extracting "Last Modified" times, and Linda White of Three Rivers Free-Net provided lots & lots of good suggestions.

[ EmailXavatoria Main Page | CGI Scripts | Back To SoCalHoops Archive Search] © 1997, Fluid Dynamics