![]() | ![]() |
Home |
|
|
Full-Text Search Specialty Data Store User's Guide |
|
| Chapter 4: Setting Up Verity Functions |
This chapter describes the setup required before you can write queries with certain Verity functionality. It includes:
The style.prm file specifies additional data to include in the text indexes to support the following functionality:
Note: The text indexes only need additional data to support phrases in the query-by-example specification of the like operator. If you use a document in the query-by-example specification, additional data is not required.
You can enable these features for all text indexes by editing the master style.prm file, or you can enable them for an individual text index by editing its style.prm file. Both methods are describe below.
To use phrases in a query-by-example specification and to use clustering, you must enable the generation of document feature vectors at indexing time. To do this, uncomment the following line in the style.prm file:
$define DOC-FEATURES "TF"
To configure the Full-Text Search engine for summarization, uncomment one of the following lines that starts with "#$define" in the style.prm file:
# The example below stores the best three sentences of
# the document, but not more than 255 bytes.
#$define DOC-SUMMARIES "XS MaxSents 3 MaxBytes 255"
# The example below stores the first four sentences of
# the document, but not more than 255 bytes.
#$define DOC-SUMMARIES "LS MaxSents 4 MaxBytes 255"
# The example below stores the first 150 bytes of
# the document, with whitespace compressed.
#$define DOC-SUMMARIES "LB MaxBytes 150"
Each of those lines reflects a different level of summarization. You can specify how many bytes of data you want the Full-Text Search engine to display, by altering the numbers at the ends of these lines. For example, if you want only the first 233 bytes of data summarized, edit the script to read:
$define DOC-SUMMARIES "LS MaxSents 4 MaxBytes 233"
The maximum number of bytes displayed is 255. Any number greater than that is truncated to 255.
The master style.prm file is located in:
$SYBASE/sds/text/verity/common/style
It contains the default Full-Text Search engine style parameters. Edit this file to configure the Full-Text Search engine so that all tables on which you create text indexes allow clustering and literal text in your query-by-example specifications, or summarization. Uncomment the applicable lines as described above.
Note: If you have existing text indexes, you must re-create the text index with these features enabled as described in "Editing Individual style.prm Files" below.
Perform the following steps to configure the Full-Text Search engine so that the individual text index allows clustering and literal text in your query-by-example specifications, or summarization:
sp_create_text_index "KRAZYKAT", "i_blurbs", "blurbs", "empty", "copy"
Note: If the text index already exists, omit this step. You do not need to create the text index again.
For example, to drop the text index created in step 1, enter:
sp_drop_text_index "blurbs.i_blurbs"
$SYBASE/sds/text/collections/db.owner.index/style
where db.owner.index is the database, the database owner, and the index created with sp_create_text_index. For example, if you create a text index called i_blurbs on the pubs2 database, the full path to these files is:
$SYBASE/sds/text/collections/pubs2.dbo.i_blurbs/style
For example, to enable clustering, uncomment the following line:
$define DOC-FEATURES "TF"
sp_create_text_index "KRAZYKAT", "i_blurbs", "blurbs", "", "copy"
Before you can sort by specific columns, you must modify the style.vgw and style.ufl files. (For information on including a column in a sort specification, see "Using the sort_by Column to Specify a Sort Order".) Both files are in the directory:
$SYBASE/sds/text/collections/db.owner.index/style
where db.owner.index is the database, the database owner, and the index created using sp_create_text_index. For example, if you created a text index called i_blurbs on the pubs2 database, the full path to those files would be similar to:
$SYBASE/sds/text/collections/pubs2.dbo.i_blurbs/style
To edit the style.vgw and style.ufl files, follow these steps:
For example, to add definitions for the copy column in the blurbs table, use the following command to drop the text index:
sp_drop_text_index i_blurbs
dda "SybaseTextServer"add an entry for the column you are defining. The syntax is:
table: DOCUMENTSwhere column_number is the number of the column you are defining. Column numbers start with 0; if you want the first column to be sorted, specify "f0"; to sort the second column, specify "f1"; to sort the third column, specify "f2", and so on.
{
copy: fcolumn_number copy_column_number
}
For example, to define the first column in a table, the syntax is:
table: DOCUMENTSThen, your style.vgw file will be similar to this:
{
copy: f0 copy_f0
}
#
# Sybase Text Server Gateway
#
$control: 1
gateway:
{
dda: "SybaseTextServer"
{
copy: f0 copy_f0
}
}
data-table: fts
{
fixwidth: copy_fcolumn_number precision datatypeColumn numbers start with 0; if you want the first column to be sorted, specify "f0"; to sort the second column, specify "f1"; to sort the third column, specify "f2", and so on. For example, to add a definition for the first column of a table, with a precision of 4, and a datatype of date, enter:
}
data-table: ftsSimilarly, to add a definition for the second column of a table with a precision of 10, and a datatype of character, enter:
{
fixwidth: copy_f0 4 date
}
data-table: fts
{
fixwidth: copy_f1 10 text
}
To perform accurate searches on documents that contain tags (such as HTML or postscript), the text index must use a filter to strip out the tags. The Standard Full-Text Search engine provides filtering for SGML and HTML documents. The Enhanced Full-Text Search engine provides filters for a variety of document types (Microsoft Word, FrameMaker, WordPerfect, SGML, HTML, and so on).
When you create the text index to use a filter, the data for each type of tag in the document is placed into its own document zone. For example, if you have a tag called "chapter," all chapter names are placed into one document zone. You can issue a query that searches the entire document, or that searches only for data in the "chapter" zone (for more information, see "in").
To create a text index that uses a filter, modify the style.dft file for that text index. To edit the style.dft file, follow these steps:
sp_create_text_index "KRAZYKAT", "i_blurbs", "blurbs", "empty", "copy"
WARNING! You should specify only one column in the text index when the text index uses a filter.
sp_drop_text_index i_blurbs
$SYBASE/sds/text/collections/db.owner.index/style
where db.owner.index is the database, the database owner, and the index created using sp_create_text_index. For example, if you created a text index called i_blurbs on the pubs2 database, the full path to the style.dft file would be similar to:
$SYBASE/sds/text/collections/pubs2.dbo.i_blurbs/style
Following this line:
field: f0add syntax to use a filter.
With Standard Full-Text Search engine, use the following syntax:
/filter="zone -nocharmap"
/filter="zone -html -nocharmap"
/filter="universal"For example, your style.dft file for an SGML document in the Standard version will look like this:
$control: 1Your style.dft file for an SGML document in the Enhanced version will look like this:
dft:
{
field: f0
/filter="zone -nocharmap"
field: f1
field: f2
.
.
field: f15
{
$control: 1
dft:
{
field: f0
/filter="universal"
field: f1
field: f2
.
.
field: f15
{
sp_create_text_index "KRAZYKAT", "i_blurbs", "blurbs", "", "copy"
The Verity thesaurus operator expands a search to include the specified word and its synonyms (for information on using the thesaurus operator, see "thesaurus"). In the Enhanced version of the Full-Text Search engine, you can create a custom thesaurus that contains application-specific synonyms to use in place of the default thesaurus.
For example, the default English language thesaurus contains these words as synonyms for "money": "cash," "currency," "lucre," "wampum," and "greenbacks." You can create a custom thesaurus that contains a different set of synonyms for "money"; for example, such as: "bid," "tokens," "credit," "asset," and "verbal offer."
To create a custom thesaurus, follow these steps:
For more information on "Custom Thesaurus Support" and the mksyd utility, see the Verity Web site at:
http://www.verity.com
In the Enhanced version of Full-Text Search engine, two sample files illustrate how to set up and use a custom thesaurus:
These files are in the $SYBASE/sds/text/sample/scripts directory.
A control file contains all the synonym definitions for a thesaurus. To examine the default thesaurus, create its control file using the mksyd utility. Use the syntax:
mksyd -dump -syd $SYBASE/sds/text/verity/common/vdkLanguage/vdk20.syd -f work_location/control_file.ctl
where:
Examine the control file (control_file.ctl) that it creates to view the default synonym lists.
Create a control file that contains the new synonyms for your custom thesaurus. The control file is an ASCII text file in a structured format. Using a text editor (such as vi or emacs), either:
The control file contains synonym list definitions in a synonyms: statement. For example, the following is a control file named colors.ctl:
$control: 1
synonyms:
{
list: "red, ruby, scarlet, fuchsia,\
magenta"
list: "electric blue <or> azure"
/keys = "lapis"
}
$$
The synonyms: statement includes:
Note: If you use emacs to build a synonym list and any of your lists go beyond one line, turn off auto-fill mode. If you separate your list into multiple lines, include a backslash (\) at the end of each line so that the lines are treated as one list.
For more complex examples of control files, see the Verity Web site.
The mksyd utility creates the custom thesaurus using a control file as input. It is located in:
$SYBASE/sds/text/verity/bin
Run, or define an alias to run, mksyd from this bin directory. Create your custom thesaurus in any work directory.
The mksyd syntax for creating a custom thesaurus is:
mksyd -f control_file.ctl -syd custom_thesaurus.syd
where:
For example, to execute the mksyd utility reading the sample control file defined above, and directing output to a work directory, use the syntax:
mksyd -f /usr/u/sybase/dba/thesaurus/colors.ctl -syd /usr/u/sybase/dba/thesaurus/custom.syd
The default thesaurus named vdk20.syd is located in:
$SYBASE/sds/text/verity/common/vdkLanguage
where vdkLanguage is the value of the vdkLanguage configuration parameter (for example, the English directory is $SYBASE/sds/text/verity/common/english0). Each application and user reading from this location at runtime uses this thesaurus. To replace it with your custom thesaurus, follow these steps:
mv /sybase/sds/text/verity/common/english0/vdk20.syd default.syd
cp custom.syd /sybase/sds/text/verity/common/english0/vdk20.syd
Queries using the thesaurus operator will now use the custom thesaurus.
A topic is a grouping of information related to a concept or subject area. With topic definitions in place, a user can perform searches on the topic instead of having to write queries with complex syntax.
The user creates topics which can be combinations of words and phrases, Verity operators and modifiers, and weight values. Then, any user can query the topic.
Before you create topics, determine your application requirements, and establish standards for naming conventions and for the location of the following:
To implement topics, perform the following steps:
For more information about outline formats, operator precedence rules, and the mktopics utility, see the Verity Web site:
http://www.verity.com.
See also the Verity document Search '97 Introduction to Topics.
The following sample files illustrate the topics feature:
These files are in the $SYBASE/sds/text/sample/scripts directory.
A topic outline file specifies all the combinations of words and phrases, Verity operators and modifiers, and weight values that the search engine uses when you issue a query using the topic. The outline file is an ASCII text file in a structured format.
For example, the following outline file defines the topic "saint-bernard":
$control: 1
saint-bernard <accrue>
*0.80 "Saint Bernard"
*0.80 "St. Bernard"
* "working dogs"
* "large dogs"
* "European breeds"
$$
When you issue a query specifying the topic "saint-bernard", the Full-Text Search engine:
This example is a very basic topic definition. An outline can introduce more complex relationships by using:
For complex examples of outline files, see the Verity Web site.
Note: In Windows NT, you can use the graphical user interface of the Verity topicEDITOR product to create topic outlines. It is available from Verity. If you use topicEDITOR, it automatically creates a topic set directory, and you can go to "Creating a Knowledge Base Map" to continue setting up your topics.
Use the mktopics utility to create and populate a topic set directory. It is located in:
$SYBASE/sds/text/verity/bin
Run, or define an alias to run, mktopics from this bin directory. You can create a topic set directory or directories in any work directory.
The mktopics syntax is:
mktopics -outline outline_file.otl -topicset topic_set_directory
where:
For example, to execute the mktopics utility reading the saint-bernard.otl file defined above, and directing output to a work directory, use the syntax:
mktopics -outline /usr/u/sybase/topic_outlines/saint-bernard.otl -topicset /usr/u/sybase/topic_sets/saint-bernard_topic
A knowledge base map specifies the locations of one or more topic set directories. Create an ASCII knowledge base map file that defines the fully-qualified directory paths to your topic sets.
For example, the following knowledge base map file illustrates how you can list multiple knowledge bases in the map. The first entry identifies the topic set directory created with mktopics above.
$control:
1 kbases:
{
kb:
/kb-path = /usr/u/sybase/topic_sets/saint-bernard_topic
kb:
/kb-path = /usr/u/sybase/topic_sets/another_topic
}
Set the knowledge_base configuration parameter to point to the location of the knowledge base map. For example:
sp_text_configure KRAZYKAT, 'knowledge_base', '/usr/u/sybase/topic_sets/sample_text_topics.kbm'
The knowledge_base configuration parameter is static, and you must restart the Full-Text Search engine for the definition to take effect.
You can now execute queries using the defined topic instead of a complex query. For example, before you create the "saint-bernard" topic, you would have to use the following syntax:
...where i.index_any = "<accrue> ([80]Saint Bernard, [80]St. Bernard, working dogs, large dogs, European breeds)"
to find documents that:
After you create the topic "saint-bernard", you can use this syntax:
...where i.index_any = "<topic>saint-bernard"
or:
...where i.index_any = "saint bernard"
Note: If you enter a word in a query expression, the Full-Text Search engine tries to match it with a topic name. If you enter a phrase in a query expression, the Full-Text Search engine replaces spaces with hyphens (-), and then tries to match it with a topic name. For example, the Full-Text Search engine matches "saint bernard" with the topic "saint-bernard".
See the sample_text_topics.sql file for examples of using topics in queries.
If the knowledge_base configuration parameter specifies a knowledge base map file that does not exist, the Full-Text Search engine will not be able to start a session with Verity, and the server will not start. If the map file exists but contains invalid entries, Verity issues warning messages at start-up time.
|
|