Site Search

Help users find what they're looking for

by Read Write Tools
Abstract
Two command line tools are described which are used to parse website documents and to assemble a full-text website index. Output is stored in a format that is consumable by the rwt-search Web component. Look-ahead word autofill is available to guide users in their search for locally hosted documents.

Motivation

Full text search is the bread and butter of search engines. But relying on external search engines to help website visitors find what they are looking for is less than ideal.

The site search feature provided by Read Write Serve provides a superior experience for visitors. It allows webmasters to fine-tune the algorithm used to weight words and direct visitors to better pages. It uses autofill word look-ahead to guide user searches, circumventing the all too common and disappointing experience of "no results found" .

Site Search Command Line Tools

The two command line tools for site search are semwords which performs semantic parsing, and sitewords which assembles the parsed results into a full site index.

SEMWORDS

The semwords utility parses a BLUEPHRASE file, creating a weighted list of word occurrences.

For the most efficient use of resources, this utility should be invoked by a build tool that is sensitive to file modification timestamps, so that it triggers a semwords scan only when a document is changed. (The Read Write Tools prorenata builder has this capability.)

Weighting is performed by semantic context. For example, the words contained in a document's <title> are typically weighted very heavily. Words that are contained in <h1>, <h2>, etc. are weighted moderately. Words that are subordinate to <menu> might be weighted zero. Any semantax that is not explicitly weighted is assumed to have a weight of 1.

The software is invoked from the command line like this:

[user@host]# semwords [inputfile] [outputfile] [options]

Where [inputfile] is the BLUEPHRASE file to parse.

Where [outputfile] is the .words file for the parsed results.

Where the [options] are:

  • --stopwords Filename that contains words that should not be tabulated. Place each stop word on a separate line.
  • --weights Filename that contains weights for semantax. Specify each line with an integer weight, followed by a semantax separated by whitespace.
  • --numbers How to treat words that contain numerals:
    • 'never' never keep a word that contains even a single digit [0 - 9]
    • 'ordinals' keep ordinals (20th, 21st), and times (9am, 10am), and other words with two-digits, but discard any word that has more than two digits
    • 'mixed' keep any word that has a mixture of digits and letters, but discard pure numbers
    • 'pure' keep any pure numbers, but discard words with a mixture of digits and letters
    • 'always' keep anything that has any number of digits, mixed or pure
  • --minlength Minimum word length to keep, default is 1.
  • --keepwords Filename that contains words that should be tabulated, overriding other rules.
  • --hostpath Hostname and path prefix for the filename, like "https://www.example.com/".

SITEWORDS

The sitewords utility collates the .words files created by semwords into a site index. The resultant file can be used by the rwt-search Web component.

The software is invoked from the command line like this:

[user@host]# sitewords [inputdir] [outputfile] [options]

Where [inputdir] is the directory containing the .words files to collate. This may contain a nested hierarchy of subdirectories.

Where [outputfile] is the sitewords file containing all of the words, weights, and hyperlinks needed to conduct a site search.

Where there is only one [options]:

--sort which specifies how the index is to be sorted

  • ternary sort the words in Ternary Trie order where the first word is alphabetically in the middle of the dataset, and each subsequent word one place lower and one place higher in alphabetical order, until the first word (aaa) and last word (zzz) are reached.
  • alpha sort the words alphabetically.
  • weight sort the words in descending weighted order.

Site Search Web Component

The rwt-search Web Component may be used to conduct a site search. It uses the sitewords file produced by the CLI utility to build a data structure called a Ternary Trie. This structure provides immediate access to partial word lookups, allowing for efficient autofill guidance as the user types.

The rwt-search web component is available via NPM. See the separate page for installation and setup instructions.

License and availability

The SEMWORDS and SITEWORDS software utilities are distributed with the RWSERVE HTTP/2 Web Server. They are not available separately.

SEMWORDS and SITEWORDS Software License Agreement

Copyright © 2020 Read Write Tools.

  1. This Software License Agreement ("Agreement") is a legal contract between you and Read Write Tools ("RWT"). The "Materials" subject to this Agreement include the software app "SEMWORDS and SITEWORDS" and its associated documentation.
  2. By installing, copying or otherwise using the Materials, you agree to abide by the terms of this Agreement. If you choose not to agree with these provisions, you must uninstall and delete all copies of the Materials.
  3. The Materials are protected by United States copyright law, patent law, and trade secret law, as well as international treaties on intellectual property rights. The Materials are licensed, not sold to you, and can only be used in accordance with the terms of this Agreement. RWT is and remains the owner of all titles, rights and interests in the Materials, and RWT reserves all rights not specifically granted under this Agreement.
  4. Subject to the terms of this Agreement, RWT hereby grants to you a limited, non-exclusive license to use the Materials subject to the following conditions:
    • You are allowed to install the Materials on more than one computer or device, as long as the Materials will not be used on more than one computer or device simultaneously. You may make additional copies of the Materials for backup purposes only.
    • You may not distribute, publish, make publicly available, sub-license, sell, rent, or lease the Materials.
    • You may not extract, decompile, or reverse engineer any binary or source code included in the Materials. Your license to use the Materials is limited to its use in its original packaged format, and does not include permission to extract or use parts on a separate basis.
  5. THE MATERIALS ARE PROVIDED BY READ WRITE TOOLS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL READ WRITE TOOLS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  6. Portions of the Material are covered by third-party software license agreements. Those agreements have their own terms and conditions, which may include restrictions and limitations on intellectual property use, distribution, publication, and modification that differ from this Agreement. Those agreements are:
    1. Node.js License
    2. V8 License
    3. nghttp2 License
    4. Joezone License
    5. Blue Phrase Processor Software License Agreement

    The terms and conditions of those third-party agreements apply to the respective intellectual property covered by those software license agreements, and do not extend to any Material owned by Read Write Tools.

  7. This license is effective until terminated. Without prejudice to any other rights, RWT may terminate your right to use the Materials if you fail to comply with the terms of this Agreement. In such event, you shall uninstall and delete all copies of the Materials.
  8. This Agreement is governed by and interpreted in accordance with the laws of the State of California. If for any reason a court of competent jurisdiction finds any provision of the Agreement to be unenforceable, that provision will be enforced to the maximum extent possible to effectuate the intent of the parties and the remainder of the Agreement shall continue in full force and effect.

Site Search — Help users find what they're looking for

🔎