![]() | ![]() |
Home |
|
|
Full-Text Specialty Data Store User's Guide |
|
| Chapter 4 Setting Up Verity Functions |
Chapter 4
This chapter describes the setup required before you can write queries with certain Verity functionality.
The style.prm file specifies additional data to include in the text indexes to support the following functionality:
Query-by-example - Retrieves documents that are similar to a phrase (see "like" for more information).
The text indexes only need additional data to support phrases in the query-by-example specification of the like operator. If you use a document in the query-by-example specification, additional data is not required.
Summarization - returns summaries of documents rather than entire documents (see "Using the summary Column to Summarize Documents" for more information).
Clustering - groups documents in result sets by subtopic (see "Using Pseudo Columns to Request Clustered Result Sets" for more information). Clustering is available only with the Enhanced Full-Text Search engine.
You can enable these features for all text indexes by editing the master style.prm file, or you can enable them for an individual text index by editing its style.prm file. Both methods are describe below.
To use phrases in a query-by-example specification and to use clustering, you must enable the generation of document feature vectors at indexing time. To do this, uncomment the following line in the style.prm file:
$define DOC-FEATURES "TF"
To configure the Full-Text Search engine for summarization, uncomment one of the following lines that starts with "#$define" in the style.prm file:
# The example below stores the best three sentences of # the document, but not more than 255 bytes. #$define DOC-SUMMARIES "XS MaxSents 3 MaxBytes 255"
# The example below stores the first four sentences of # the document, but not more than 255 bytes. #$define DOC-SUMMARIES "LS MaxSents 4 MaxBytes 255"
# The example below stores the first 150 bytes of # the document, with whitespace compressed. #$define DOC-SUMMARIES "LB MaxBytes 150"
Each of those lines reflects a different level of summarization. You can specify how many bytes of data you want the Full-Text Search engine to display, by altering the numbers at the ends of these lines. For example, if you want only the first 233 bytes of data summarized, edit the script to read:
$define DOC-SUMMARIES "LS MaxSents 4 MaxBytes 233"
The maximum number of bytes displayed is 255. Any number greater than that is truncated to 255.
The master style.prm file is located in:
$SYBASE/$SYBASE_FTS/verity/common/style
It contains the default Full-Text Search engine style parameters. Edit this file to configure the Full-Text Search engine so that all tables on which you create text indexes allow clustering and literal text in your query-by-example specifications, or summarization. Uncomment the applicable lines as described above.
If you have existing text indexes, you must re-create the text index with these features enabled as described in Editing Individual style.prm Files below.
Perform the following steps to configure the Full-Text Search engine so that the individual text index allows clustering and literal text in your query-by-example specifications, or summarization:
Create the text index using sp_create_text_index. Use the word "empty" in the option_string parameter so that the style.prm file is created for the text index, but the Verity collections are not populated with data. For example, if you are enabling clustering for the copy column of the blurbs table, use the following syntax:
sp_create_text_index "KRAZYKAT", "i_blurbs", "blurbs", "empty", "copy"
If the text index already exists, omit this step. You do not need to create the text index again.
Use sp_drop_text_index to drop the text index associated with the style.prm file you are editing.
For example, to drop the text index created in step 1, enter:
sp_drop_text_index "blurbs.i_blurbs"
Edit the style.prm file that exists for the text index. The style.prm file for an existing collection is located in:
$SYBASE/$SYBASE_FTS/collections/db.owner.index/style
where db.owner.index is the database, the database owner, and the index created with sp_create_text_index. For example, if you create a text index called i_blurbs on the pubs2 database, the full path to these files is:
$SYBASE/$SYBASE_FTS/collections/pubs2.dbo.i_blurbs/style
Uncomment the applicable lines as described above.
For example, to enable clustering, uncomment the following line:
$define DOC-FEATURES "TF"
Re-create the text index you dropped in step 2. For example, to re-create the i_blurbs text index, enter:
sp_create_text_index "KRAZYKAT", "i_blurbs", "blurbs", "", "copy"
Before you can sort by specific columns, you must modify the style.vgw and style.ufl files. (For information on including a column in a sort specification, see "Using the sort_by Column to Specify a Sort Order".) Both files are in the directory:
$SYBASE/$SYBASE_FTS/collections/db.owner.index/style
where db.owner.index is the database, the database owner, and the index created using sp_create_text_index. For example, if you created a text index called i_blurbs on the pubs2 database, the full path to those files would be similar to:
$SYBASE/$SYBASE_FTS/collections/pubs2.dbo.i_blurbs/style
To edit the style.vgw and style.ufl files, follow these steps:
Drop the text index that contains the columns for which you are adding definitions. (Dropping the text index does not drop the collection directory.)
For example, to add definitions for the copy column in the blurbs table, use the following command to drop the text index:
sp_drop_text_index i_blurbs
Edit the style.vgw file. Following this line:
dda "SybaseTextServer"
add an entry for the column you are defining. The syntax is:
table: DOCUMENTS
{
copy: fcolumn_number copy_column_number
}where column_number is the number of the column you are defining. Column numbers start with 0; if you want the first column to be sorted, specify "f0"; to sort the second column, specify "f1"; to sort the third column, specify "f2", and so on.
For example, to define the first column in a table, the syntax is:
table: DOCUMENTS
{
copy: f0 copy_f0
}Then, your style.vgw file will be similar to this:
#
# Sybase Text Server Gateway
#
$control: 1
gateway:
{
dda: "SybaseTextServer"
{
copy: f0 copy_f0
}
}Edit the style.ufl file, by adding the column definition for a data table named fts. The syntax is:
data-table: fts
{fixwidth: copy_fcolumn_number precision datatype }
Column numbers start with 0; if you want the first column to be sorted, specify "f0"; to sort the second column, specify "f1"; to sort the third column, specify "f2", and so on. For example, to add a definition for the first column of a table, with a precision of 4, and a datatype of date, enter:
data-table: fts
{
fixwidth: copy_f0 4 date
}Similarly, to add a definition for the second column of a table with a precision of 10, and a datatype of character, enter:
data-table: fts
{
fixwidth: copy_f1 10 text
}Re-create the index, using sp_create_text_index.
To perform accurate searches on documents that contain tags (such as HTML or postscript), the text index must use a filter to strip out the tags. The Enhanced Full-Text Search engine provides filters for a variety of document types (Microsoft Word, FrameMaker, WordPerfect, SGML, HTML, and others).
When you create the text index to use a filter, the data for each type of tag in the document is placed into its own document zone. For example, if you have a tag called "chapter," all chapter names are placed into one document zone. You can issue a query that searches the entire document, or that searches only for data in the "chapter" zone (for more information, see "in").
To create a text index that uses a filter, modify the style.dft file for that text index. To edit the style.dft file, follow these steps:
Create the text index using sp_create_text_index. Use the word "empty" in the option_string parameter so that the style.dft file is created for the text index, but the Verity collections are not populated with data. For example, to create a text index for the copy column of the blurbs table, use the following syntax:
sp_create_text_index "KRAZYKAT", "i_blurbs", "blurbs", "empty", "copy"
Drop the text index that you create in step 1. This drops the text index, but not the style.dft file. For example, use the following command to drop the i_blurbs text index:
sp_drop_text_index i_blurbs
Edit the style.dft file. The style.dft file is in the directory:
$SYBASE/$SYBASE_FTS/collections/db.owner.index/style
where db.owner.index is the database, the database owner, and the index created using sp_create_text_index. For example, if you created a text index called i_blurbs on the pubs2 database, the full path to the style.dft file would be similar to:
$SYBASE/$SYBASE_FTS/collections/pubs2.dbo.i_blurbs/style
Following this line:
field: f0
add syntax to use a filter.
Use the following syntax:
For SGML documents, use:
/filter="zone -nocharmap"
For HTML documents, use:
/filter="zone -html -nocharmap"
With Enhanced Full-Text Search engine, use the following syntax for all document types:
/filter="universal"
For example, your style.dft file for an SGML document in the will look like this:
$control: 1
dft:
{
field: f0
/filter="zone -nocharmap"
field: f1
field: f2
.
.
field: f15
{Your style.dft file for an SGML document in the Enhanced version will look like this:
$control: 1
dft:
{
field: f0
/filter="universal"
field: f1
field: f2
.
.
field: f15
{Use getsend to load the database with document data. getsend takes the following arguments: database, table, column and row id. Insert a null value for the rowid for each row of text you wish to insert. getsend must insert into an image column for filtering to work. For more information on getsend, refer to the README.TXT file and getsend.c file in $SYBASE/$SYBASE_FTS/sample/source directory.
Re-create the index, using sp_create_text_index. For example:
sp_create_text_index "KRAZYKAT", "i_blurbs", "blurbs", "", "copy"
The Verity thesaurus operator expands a search to include the specified word and its synonyms (for information on using the thesaurus operator, see "thesaurus"). In the Enhanced version of the Full-Text Search engine, you can create a custom thesaurus that contains application-specific synonyms to use in place of the default thesaurus.
For example, the default English language thesaurus contains these words as synonyms for "money": "cash," "currency," "lucre," "wampum," and "greenbacks." You can create a custom thesaurus that contains a different set of synonyms for "money"; for example, such as: "bid," "tokens," "credit," "asset," and "verbal offer."
To create a custom thesaurus, follow these steps:
Make a list of the synonyms that you will use with your application. It may help to examine the default thesaurus (see "Examining the Default Thesaurus (Optional)").
Create a control file that contains the synonyms you are defining for your custom thesaurus (see "Creating the Control File").
Create the custom thesaurus using the mksyd utility (see "Creating the Thesaurus"). This uses the control file as input.
Replace the default thesaurus with your custom thesaurus (see "Replacing the Default Thesaurus with the Custom Thesaurus").
For more information on "Custom Thesaurus Support" and the mksyd utility, see the Verity Web site .
In the Enhanced version of Full-Text Search engine, two sample files illustrate how to set up and use a custom thesaurus:
sample_text_thesaurus.ctl is a sample control file
sample_text_thesaurus.sql issues queries against the custom thesaurus defined in the sample control file
These files are in the $SYBASE/$SYBASE_FTS/sample/scripts directory.
A control file contains all the synonym definitions for a thesaurus. To examine the default thesaurus, create its control file using the mksyd utility. Use the syntax:
mksyd -dump -syd $SYBASE/$SYBASE_FTS/verity/common/vdkLanguage/vdk20.syd -f work_location/control_file.ctl
where:
vdkLanguage - is the value of the vdkLanguage configuration parameter (for example, "english")
work_location - is the directory where you want to place the control file
control_file - is the name of the control file you are creating from the default thesaurus
Examine the control file (control_file.ctl) that it creates to view the default synonym lists.
Create a control file that contains the new synonyms for your custom thesaurus. The control file is an ASCII text file in a structured format. Using a text editor (such as vi or emacs), either:
Edit the control file from the default thesaurus and add new synonyms to the existing thesaurus (see "Examining the Default Thesaurus (Optional)"), or
Create a new control file that includes only your synonyms
The control file contains synonym list definitions in a synonyms: statement. For example, the following is a control file named colors.ctl:
$control: 1
synonyms:
{
list: "red, ruby, scarlet, fuchsia,\
magenta"
list: "electric blue <or> azure"
/keys = "lapis"
}
$$The synonyms: statement includes:
list: keywords that specify the start of a synonym list. The synonyms in the list are either in query form or in a list of words or phrases separated by commas.
Each list: can optionally have a /keys modifier that specifies one or more keys separated by commas. In the example above, no keys are specified in the first "list". This means the list is found when the thesaurus is queried for the word "red," "ruby," "scarlet," "fuchsia," or "magenta." The second "list" uses the /keys modifier to specify one key. This means the words or phrases in the list will satisfy a query only when you specify <thesaurus>lapis.
If you use emacs to build a synonym list and any of your lists go beyond one line, turn off auto-fill mode. If you separate your list into multiple lines, include a backslash (\) at the end of each line so that the lines are treated as one list.
For more complex examples of control files, see the Verity Web site.
The mksyd utility creates the custom thesaurus using a control file as input. It is located in:
$SYBASE/$SYBASE_FTS/verity/bin
Run, or define an alias to run, mksyd from this bin directory. Create your custom thesaurus in any work directory.
The mksyd syntax for creating a custom thesaurus is:
mksyd -f control_file.ctl -syd custom_thesaurus.syd
where:
control_file - is the name of the control file you create in Creating the Control File above
custom_thesaurus - is the name of the custom thesaurus you are creating
For example, to execute the mksyd utility reading the sample control file defined above, and directing output to a work directory, use the syntax:
mksyd -f /usr/u/sybase/dba/thesaurus/colors.ctl -syd /usr/u/sybase/dba/thesaurus/custom.syd
The default thesaurus named vdk20.syd is located in:
$SYBASE/$SYBASE_FTS/verity/common/vdkLanguage
where vdkLanguage is the value of the vdkLanguage configuration parameter (for example, the English directory is $SYBASE/$SYBASE_FTS/verity/common/english). Each application and user reading from this location at runtime uses this thesaurus. To replace it with your custom thesaurus, follow these steps:
Back up the default thesaurus before replacing it with the custom thesaurus. For example:
mv /$SYBASE/$SYBASE_FTS/verity/common/english/vdk20.syd default.syd
Replace the vdk20.syd file with your custom thesaurus. For example:
cp custom.syd /$SYBASE/$SYBASE_FTS/verity/common/english/vdk20.syd
Restart your Full-Text Search engine; no configuration file changes are required. The thesaurus is read from this location when the Full-Text Search engine is started, not when a query is executed.
Queries using the thesaurus operator will now use the custom thesaurus.
The section provides a condensed overview of Verity Topics. Topics are discussed in detail in Chapter 8, "Verity Topics."
A TOPICŪ is a grouping of information related to a concept or subject area. With topic definitions in place, a user can perform searches on the topic instead of having to write queries with complex syntax.
The user creates topics which can be combinations of words and phrases, Verity operators and modifiers, and weight values. Then, any user can query the topic.
Before you create topics, determine your application requirements, and establish standards for naming conventions and for the location of the following:
Outline files - contains the topic definitions. Each topic has its own outline file.
Topic set directories - contains the compiled topic. Each topic has its own topic set directory.
Knowledge base map file - contains pointers to the topic set directories.
To implement topics, perform the following steps:
Create one or more outline input files to define your topics (see "Creating an Outline File"). Each outline file is used to populate one topic set.
Create and populate a topic set directory, using the mktopics utility (see "Creating a Topic Set Directory"). Each topic set directory is populated based on one topic outline input file.
Create a knowledge base map, specifying the locations of one or more topic set directories (see "Creating a Knowledge Base Map")
Set the knowledge_base configuration parameter to point to the location of the knowledge base map (see "Defining the Location of the Knowledge Base Map")
Execute queries against defined topics.
The following sample files illustrate the topics feature:
sample_text_topics.otl is a sample outline file
sample_text_topics.kbm is a sample knowledge base map
sample_text_topics.sql issues queries using defined topics
These files are in the $SYBASE/$SYBASE_FTS/sample/scripts directory.
A topic outline file specifies all the combinations of words and phrases, Verity operators and modifiers, and weight values that the search engine uses when you issue a query using the topic. The outline file is an ASCII text file in a structured format.
For example, the following outline file defines the topic "saint-bernard":
$control: 1 saint-bernard <accrue> *0.80 "Saint Bernard" *0.80 "St. Bernard" * "working dogs" * "large dogs" * "European breeds" $$
When you issue a query specifying the topic "saint-bernard", the Full-Text Search engine:
Returns documents that contain one or more of the following phrases: "Saint Bernard," "St. Bernard," "working dogs," "large dogs," and "European breeds"
Scores documents that contain the phrase "Saint Bernard" or "St. Bernard" higher than documents that contain the phrase "working dogs, "large dogs," or "European breeds"
This example is a very basic topic definition. An outline can introduce more complex relationships by using:
Multiple levels of subtopics
Combinations of Verity operators (this example uses accrue)
Verity modifiers
In Windows NT, you can use the graphical user interface of the Verity Intelligent Classifier product to create topic outlines. It is available from Verity. If you use Intelligent Classifier, it automatically creates a topic set directory, and you can go to "Creating a Knowledge Base Map" to continue setting up your topics.
Use the mktopics utility to create and populate a topic set directory. It is located in:
$SYBASE/$SYBASE_FTS/verity/bin
Run, or define an alias to run, mktopics from this bin directory. You can create a topic set directory or directories in any work directory.
The mktopics syntax is:
mktopics -outline outline_file.otl -topicset topic_set_directory
where:
outline_file - is the name of the outline file you create in "Creating an Outline File"
topic_set_directory -is the name of the topic set directory you are creating
For example, to execute the mktopics utility reading the saint-bernard.otl file defined above, and directing output to a work directory, use the syntax:
mktopics -outline /usr/u/sybase/topic_outlines/saint-bernard.otl -topicset /usr/u/sybase/topic_sets/saint-bernard_topic
A knowledge base map specifies the locations of one or more topic set directories. Create an ASCII knowledge base map file that defines the fully-qualified directory paths to your topic sets.
For example, the following knowledge base map file illustrates how you can list multiple knowledge bases in the map. The first entry identifies the topic set directory created with mktopics above.
$control:1
kbases:
{
kb:
/kb-path = /usr/u/sybase/topic_sets/saint-bernard_topic
kb:
/kb-path = /usr/u/sybase/topic_sets/another_topic
}Set the knowledge_base configuration parameter to point to the location of the knowledge base map. For example:
sp_text_configure KRAZYKAT, 'knowledge_base', '/usr/u/sybase/topic_sets/sample_text_topics.kbm'
The knowledge_base configuration parameter is static, and you must restart the Full-Text Search engine for the definition to take effect.
You can now execute queries using the defined topic instead of a complex query. For example, before you create the "saint-bernard" topic, you would have to use the following syntax:
...where i.index_any = "<accrue> ([80]Saint Bernard, [80]St. Bernard, working dogs, large dogs, European breeds)"
to find documents that:
Contain one or more of the following phrases: "Saint Bernard," "St. Bernard," "working dogs," "large dogs," and "European breeds"
Score documents containing the phrase "Saint Bernard" or "St. Bernard" higher than documents containing the phrase "working dogs," "large dogs," or "European breeds"
After you create the topic "saint-bernard", you can use this syntax:
...where i.index_any = "<topic>saint-bernard"
or:
...where i.index_any = "saint bernard"
If you enter a word in a query expression, the Full-Text Search engine tries to match it with a topic name. If you enter a phrase in a query expression, the Full-Text Search engine replaces spaces with hyphens (-), and then tries to match it with a topic name. For example, the Full-Text Search engine matches "saint bernard" with the topic "saint-bernard".
See the sample_text_topics.sql file for examples of using topics in queries.
If the knowledge_base configuration parameter specifies a knowledge base map file that does not exist, the Full-Text Search engine will not be able to start a session with Verity, and the server will not start. If the map file exists but contains invalid entries, Verity issues warning messages at start-up time. You can correct errors by editing the <textserver>.cfg file in the $SYBASE directory. You can correct path information and change the line beginning: "knowledge_base=".
|
|