ServBay Chinese Full-Text Search: zhparser Usage Guide

zhparser is a powerful third-party PostgreSQL extension specifically designed for efficient handling of Chinese text. It equips PostgreSQL databases with precise Chinese word segmentation and full-text search capabilities, making it an ideal choice for building applications involving Chinese content search. ServBay, as a comprehensive local web development environment, comes with the zhparser extension pre-installed and supports integration with the scws (Simple Chinese Word Segmentation) library, allowing you to leverage custom dictionaries through scws.

This article provides a detailed walkthrough on how to install (enable) and configure zhparser within the ServBay environment, demonstrates its usage for Chinese full-text search, and explains how to create and apply custom dictionaries using ServBay’s built-in scws.

1. Overview

For applications dealing with substantial amounts of Chinese text, such as content management systems, forums, and e-commerce platforms, implementing efficient and accurate full-text search is essential. Although PostgreSQL natively supports full-text search, its default behavior handles Chinese poorly, as it is primarily designed for space-delimited languages. The zhparser extension integrates Chinese word segmentation technology to solve this problem, allowing PostgreSQL to recognize word boundaries in Chinese text and enabling effective full-text search.

ServBay pre-integrates zhparser, sparing developers from the hassle of manually compiling and installing the extension. You can quickly set up a local development environment with Chinese search capabilities out of the box.

2. Prerequisites

Before using zhparser, please ensure that:

ServBay is successfully installed.
The PostgreSQL package within ServBay is enabled and running. You can check and manage package status through the ServBay application interface.

3. Installing (Enabling) the zhparser Extension

ServBay has already placed the zhparser module file where PostgreSQL can access it. All you need to do is run an SQL command in your target database to enable it.

Connect to your PostgreSQL database:
Open a terminal and use the psql command-line tool to connect to the PostgreSQL database managed by ServBay. Replace servbay-demo with your actual database username, and your_database_name with the name of the database where zhparser should be enabled.
bash
```
psql -U servbay-demo -d your_database_name
```
1
If connecting to the default database (usually the same as the username), you can omit the -d parameter.
Create the zhparser extension:
In the psql interactive interface, execute the following SQL command:
sql
```
CREATE EXTENSION zhparser;
```
1
If the command executes successfully, there should be no error message. If you see a message stating the extension already exists, it means it has already been enabled.
Verify zhparser installation:
You can check the list of installed extensions using the command:
sql
```
\dx
```
1
In the output, you should find zhparser along with its version information.

4. Configuring zhparser

After enabling zhparser, you need to configure PostgreSQL’s text search functionality to use zhparser for Chinese word segmentation. This mainly involves creating a Text Search Configuration.

Create a text search configuration:
A text search configuration defines how documents are converted to tsvector (for indexing) and how query strings are converted to tsquery (for searching). Let's create a configuration named chinese and designate zhparser as its PARSER.
sql
```
CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);
```
1
Add dictionary mappings:
The text search configuration also needs to map the types of tokens generated by the parser (zhparser) to dictionaries for processing. zhparser labels tokens by part-of-speech, such as nouns (n), verbs (v), adjectives (a), etc. Here, we map tokens labeled as nouns (n), verbs (v), adjectives (a), interjections (i), emotion words (e), and numerals (l) to the simple dictionary. The simple dictionary does not alter tokens; it preserves words segmented by zhparser as-is.
sql
```
ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l WITH simple;
-- You can add or modify part-of-speech tags and dictionaries as needed.
```
1
2
Note: The part-of-speech tags supported by zhparser may slightly differ from standard NLP tags. These are the most commonly used tags in zhparser.

5. Using zhparser for Full-Text Search

Once configuration is complete, you can start using zhparser for Chinese full-text search. Here’s a simple demonstration.

5.1 Create a Sample Table and Data

First, create a sample table for storing Chinese text and insert some data.

Create the table:

sql

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT
);

Insert sample data:

sql

INSERT INTO documents (content) VALUES
('I love natural language processing'),
('Chinese word segmentation is an important step in text processing'),
('zhparser is a great Chinese word segmentation tool'),
('ServBay makes local development more convenient');

5.2 Create a Full-Text Search Index

To improve search performance, especially on large datasets, it is highly recommended to create an index on the column used for full-text search. For columns of type tsvector, a GIN (Generalized Inverted Index) index is typically recommended as it is very efficient for full-text search queries.

Create a GIN index:
We'll create a GIN index on the content column. While creating the index, we use to_tsvector('chinese', content) to convert the content field to tsvector format, specifying our previously created chinese text search configuration, so the index leverages zhparser for word segmentation.
sql
```
CREATE INDEX idx_gin_content ON documents USING gin (to_tsvector('chinese', content));
```
1

5.3 Execute Full-Text Search Queries

Now, you can use to_tsquery to convert keywords into query format and use the @@ operator to perform match queries against the tsvector index column.

Execute a search query:
For example, to search for documents containing both “Chinese” and “word segmentation”:
sql
```
SELECT
    id,
    content,
    to_tsvector('chinese', content) AS content_tsvector -- Optional: view segmentation result
FROM documents
WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', '中文 & 分词');
```
1
2
3
4
5
6
This query will return documents with id 2 and 3 because their content contains both “Chinese” and “word segmentation”.
You can try different queries:
- Search for documents containing “ServBay”:
  sql
```
SELECT * FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', 'ServBay');
```
  1
  (Will return the document with id 4)
- Search for documents containing “natural language processing”:
  sql
```
SELECT * FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', '自然语言处理');
```
  1
  (Will return the document with id 1. Note that zhparser may segment “natural language processing” as a whole word or separate it, depending on the segmentation mode and dictionary. If you add it to a custom dictionary, results will improve.)

6. Creating Custom Dictionaries with ServBay’s Built-In scws

ServBay integrates the scws library, allowing zhparser to use scws dictionary files, including custom dictionaries, to improve segmentation accuracy, especially for domain-specific terms or neologisms.

6.1 Create a Custom Dictionary File

Create or edit a custom dictionary file:
ServBay recommends placing your custom scws dictionary files in /Applications/ServBay/etc/scws/. Create a file named custom_dict.txt (if it doesn't already exist).
bash
```
# Create or edit the file in terminal
nano /Applications/ServBay/etc/scws/custom_dict.txt
```
1
2
Add vocabulary to the file:
In custom_dict.txt, add one word per line that you want zhparser to recognize as a distinct term. For example:
plaintext
```
Natural Language Processing
Chinese Word Segmentation
ServBay
Local Development Environment
```
1
2
3
4
Save and close the file.

6.2 Configure zhparser to Use the Custom Dictionary

You need to tell zhparser to use this custom dictionary file.

Set the zhparser.dict_path parameter:
In your PostgreSQL session, run the following command to set the path for zhparser’s dictionary:
sql
```
SET zhparser.dict_path = '/Applications/ServBay/etc/scws/custom_dict.txt';
-- Please ensure the path is correct and the PostgreSQL user has read permissions.
```
1
2
Note: The SET command only affects the current database session. To make this setting take effect for all new connections, you’ll need to edit PostgreSQL’s configuration file, postgresql.conf, by adding or updating zhparser.dict_path = '/Applications/ServBay/etc/scws/custom_dict.txt' and then restart the PostgreSQL service (via the ServBay application interface). For local development and testing, using SET is usually sufficient and convenient.

6.3 Reload the Dictionary

After modifying the dictionary file, you need to instruct zhparser to reload the dictionary in order for the changes to take effect.

Call the reload function:
sql
```
SELECT zhprs_reload_dict();
```
1
After executing this function, subsequent segmentation operations will use the updated dictionary with your custom terms.

7. Adjusting Segmentation Modes

zhparser supports different segmentation modes that affect segmentation granularity. The most commonly used control parameter is zhparser.seg_with_duality.

7.1 Set Segmentation Mode

Set the zhparser.seg_with_duality parameter:
- Setting to true: Enables "dual segmentation" mode for finer granularity, which improves recall (finds more relevant documents). For example, “Natural Language Processing” might be split into “Natural,” “Language,” “Processing,” “Natural Language,” “Language Processing.”
  sql
```
SET zhparser.seg_with_duality = true;
```
  1
- Setting to false: Uses coarser segmentation, generally matching the longest terms in the dictionary, which may improve precision. For example, “Natural Language Processing” will usually be one word if in the dictionary.
  sql
```
SET zhparser.seg_with_duality = false;
```
  1
Choose the mode based on your specific application needs. You can also set this parameter globally via the postgresql.conf file.

8. Frequently Asked Questions (FAQ)

Q: CREATE EXTENSION zhparser; shows the extension cannot be found?
A: Ensure that the PostgreSQL package in ServBay is correctly installed and running. ServBay should have placed the zhparser library file in PostgreSQL's extension directory. If the issue persists, check whether ServBay and PostgreSQL installations are complete, or try restarting ServBay.
Q: Custom dictionary is not taking effect?
A: Please check the following:
1. Has the zhparser.dict_path parameter been correctly set to your custom dictionary file path (/Applications/ServBay/etc/scws/custom_dict.txt)? Remember the path is case sensitive.
2. Did you execute SELECT zhprs_reload_dict(); after setting zhparser.dict_path to reload the dictionary?
3. Is your custom dictionary file formatted correctly (one word per line)?
4. If you are testing in a new database session, did you set SET zhparser.dict_path = ...; again, or was it added to postgresql.conf and PostgreSQL restarted?
5. Does the PostgreSQL user have read permission for the dictionary file?
Q: Full-text search results are not as expected?
A: Check if your text search configuration (chinese) correctly maps part-of-speech tags to dictionaries. Try adjusting the zhparser.seg_with_duality parameter and observe if this affects the results. Use SELECT to_tsvector('chinese', 'your Chinese text'); to inspect how a particular text is segmented, which helps with debugging. Also, make sure your search query (to_tsquery) uses the right keywords and logical operators (&, |, !).
Q: Full-text search performance is poor?
A: Check that a GIN index has been created on the to_tsvector(...) column. For very large datasets, consider optimizing PostgreSQL configurations further or exploring advanced index techniques.

9. Conclusion

With ServBay, implementing Chinese full-text search in PostgreSQL using zhparser is extremely straightforward. With just a few simple steps, you can enable the extension, configure text search, and leverage ServBay's built-in scws for custom dictionary support. Mastering the fundamentals of zhparser and its configuration will greatly enhance your local development environment’s capability to process Chinese text data, providing a solid foundation for building feature-rich Chinese language applications.

ServBay Chinese Full-Text Search: zhparser Usage Guide ​

1. Overview ​

2. Prerequisites ​

3. Installing (Enabling) the zhparser Extension ​

4. Configuring zhparser ​

5. Using zhparser for Full-Text Search ​

5.1 Create a Sample Table and Data ​

5.2 Create a Full-Text Search Index ​

5.3 Execute Full-Text Search Queries ​

6. Creating Custom Dictionaries with ServBay’s Built-In scws ​

6.1 Create a Custom Dictionary File ​

6.2 Configure zhparser to Use the Custom Dictionary ​

6.3 Reload the Dictionary ​

7. Adjusting Segmentation Modes ​

7.1 Set Segmentation Mode ​

8. Frequently Asked Questions (FAQ) ​

9. Conclusion ​