ServBay Chinese Full-Text Search: zhparser Usage Guide
zhparser
is a powerful third-party PostgreSQL extension specifically designed for efficient handling of Chinese text. It equips PostgreSQL databases with precise Chinese word segmentation and full-text search capabilities, making it an ideal choice for building applications involving Chinese content search. ServBay, as a comprehensive local web development environment, comes with the zhparser
extension pre-installed and supports integration with the scws
(Simple Chinese Word Segmentation) library, allowing you to leverage custom dictionaries through scws
.
This article provides a detailed walkthrough on how to install (enable) and configure zhparser
within the ServBay environment, demonstrates its usage for Chinese full-text search, and explains how to create and apply custom dictionaries using ServBay’s built-in scws
.
1. Overview
For applications dealing with substantial amounts of Chinese text, such as content management systems, forums, and e-commerce platforms, implementing efficient and accurate full-text search is essential. Although PostgreSQL natively supports full-text search, its default behavior handles Chinese poorly, as it is primarily designed for space-delimited languages. The zhparser
extension integrates Chinese word segmentation technology to solve this problem, allowing PostgreSQL to recognize word boundaries in Chinese text and enabling effective full-text search.
ServBay pre-integrates zhparser
, sparing developers from the hassle of manually compiling and installing the extension. You can quickly set up a local development environment with Chinese search capabilities out of the box.
2. Prerequisites
Before using zhparser
, please ensure that:
- ServBay is successfully installed.
- The PostgreSQL package within ServBay is enabled and running. You can check and manage package status through the ServBay application interface.
3. Installing (Enabling) the zhparser Extension
ServBay has already placed the zhparser
module file where PostgreSQL can access it. All you need to do is run an SQL command in your target database to enable it.
Connect to your PostgreSQL database:
Open a terminal and use thepsql
command-line tool to connect to the PostgreSQL database managed by ServBay. Replaceservbay-demo
with your actual database username, andyour_database_name
with the name of the database wherezhparser
should be enabled.bashpsql -U servbay-demo -d your_database_name
1If connecting to the default database (usually the same as the username), you can omit the
-d
parameter.Create the
zhparser
extension:
In thepsql
interactive interface, execute the following SQL command:sqlCREATE EXTENSION zhparser;
1If the command executes successfully, there should be no error message. If you see a message stating the extension already exists, it means it has already been enabled.
Verify
zhparser
installation:
You can check the list of installed extensions using the command:sql\dx
1In the output, you should find
zhparser
along with its version information.
4. Configuring zhparser
After enabling zhparser
, you need to configure PostgreSQL’s text search functionality to use zhparser
for Chinese word segmentation. This mainly involves creating a Text Search Configuration.
Create a text search configuration:
A text search configuration defines how documents are converted totsvector
(for indexing) and how query strings are converted totsquery
(for searching). Let's create a configuration namedchinese
and designatezhparser
as its PARSER.sqlCREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);
1Add dictionary mappings:
The text search configuration also needs to map the types of tokens generated by the parser (zhparser
) to dictionaries for processing.zhparser
labels tokens by part-of-speech, such as nouns (n), verbs (v), adjectives (a), etc. Here, we map tokens labeled as nouns (n), verbs (v), adjectives (a), interjections (i), emotion words (e), and numerals (l) to thesimple
dictionary. Thesimple
dictionary does not alter tokens; it preserves words segmented byzhparser
as-is.sqlALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l WITH simple; -- You can add or modify part-of-speech tags and dictionaries as needed.
1
2Note: The part-of-speech tags supported by
zhparser
may slightly differ from standard NLP tags. These are the most commonly used tags inzhparser
.
5. Using zhparser for Full-Text Search
Once configuration is complete, you can start using zhparser
for Chinese full-text search. Here’s a simple demonstration.
5.1 Create a Sample Table and Data
First, create a sample table for storing Chinese text and insert some data.
Create the table:
sqlCREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT );
1
2
3
4Insert sample data:
sqlINSERT INTO documents (content) VALUES ('I love natural language processing'), ('Chinese word segmentation is an important step in text processing'), ('zhparser is a great Chinese word segmentation tool'), ('ServBay makes local development more convenient');
1
2
3
4
5
5.2 Create a Full-Text Search Index
To improve search performance, especially on large datasets, it is highly recommended to create an index on the column used for full-text search. For columns of type tsvector
, a GIN (Generalized Inverted Index) index is typically recommended as it is very efficient for full-text search queries.
Create a GIN index:
We'll create a GIN index on thecontent
column. While creating the index, we useto_tsvector('chinese', content)
to convert thecontent
field totsvector
format, specifying our previously createdchinese
text search configuration, so the index leverageszhparser
for word segmentation.sqlCREATE INDEX idx_gin_content ON documents USING gin (to_tsvector('chinese', content));
1
5.3 Execute Full-Text Search Queries
Now, you can use to_tsquery
to convert keywords into query format and use the @@
operator to perform match queries against the tsvector
index column.
Execute a search query:
For example, to search for documents containing both “Chinese” and “word segmentation”:sqlSELECT id, content, to_tsvector('chinese', content) AS content_tsvector -- Optional: view segmentation result FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', '中文 & 分词');
1
2
3
4
5
6This query will return documents with
id
2 and 3 because theircontent
contains both “Chinese” and “word segmentation”.You can try different queries:
- Search for documents containing “ServBay”:sql(Will return the document with id 4)
SELECT * FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', 'ServBay');
1 - Search for documents containing “natural language processing”:sql(Will return the document with id 1. Note that
SELECT * FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', '自然语言处理');
1zhparser
may segment “natural language processing” as a whole word or separate it, depending on the segmentation mode and dictionary. If you add it to a custom dictionary, results will improve.)
- Search for documents containing “ServBay”:
6. Creating Custom Dictionaries with ServBay’s Built-In scws
ServBay integrates the scws
library, allowing zhparser
to use scws dictionary files, including custom dictionaries, to improve segmentation accuracy, especially for domain-specific terms or neologisms.
6.1 Create a Custom Dictionary File
Create or edit a custom dictionary file:
ServBay recommends placing your customscws
dictionary files in/Applications/ServBay/etc/scws/
. Create a file namedcustom_dict.txt
(if it doesn't already exist).bash# Create or edit the file in terminal nano /Applications/ServBay/etc/scws/custom_dict.txt
1
2Add vocabulary to the file:
Incustom_dict.txt
, add one word per line that you wantzhparser
to recognize as a distinct term. For example:plaintextNatural Language Processing Chinese Word Segmentation ServBay Local Development Environment
1
2
3
4Save and close the file.
6.2 Configure zhparser to Use the Custom Dictionary
You need to tell zhparser
to use this custom dictionary file.
Set the
zhparser.dict_path
parameter:
In your PostgreSQL session, run the following command to set the path forzhparser
’s dictionary:sqlSET zhparser.dict_path = '/Applications/ServBay/etc/scws/custom_dict.txt'; -- Please ensure the path is correct and the PostgreSQL user has read permissions.
1
2Note: The
SET
command only affects the current database session. To make this setting take effect for all new connections, you’ll need to edit PostgreSQL’s configuration file,postgresql.conf
, by adding or updatingzhparser.dict_path = '/Applications/ServBay/etc/scws/custom_dict.txt'
and then restart the PostgreSQL service (via the ServBay application interface). For local development and testing, usingSET
is usually sufficient and convenient.
6.3 Reload the Dictionary
After modifying the dictionary file, you need to instruct zhparser
to reload the dictionary in order for the changes to take effect.
Call the reload function:
sqlSELECT zhprs_reload_dict();
1After executing this function, subsequent segmentation operations will use the updated dictionary with your custom terms.
7. Adjusting Segmentation Modes
zhparser
supports different segmentation modes that affect segmentation granularity. The most commonly used control parameter is zhparser.seg_with_duality
.
7.1 Set Segmentation Mode
- Set the
zhparser.seg_with_duality
parameter:- Setting to
true
: Enables "dual segmentation" mode for finer granularity, which improves recall (finds more relevant documents). For example, “Natural Language Processing” might be split into “Natural,” “Language,” “Processing,” “Natural Language,” “Language Processing.”sqlSET zhparser.seg_with_duality = true;
1 - Setting to
false
: Uses coarser segmentation, generally matching the longest terms in the dictionary, which may improve precision. For example, “Natural Language Processing” will usually be one word if in the dictionary.sqlSET zhparser.seg_with_duality = false;
1
postgresql.conf
file. - Setting to
8. Frequently Asked Questions (FAQ)
- Q:
CREATE EXTENSION zhparser;
shows the extension cannot be found?
A: Ensure that the PostgreSQL package in ServBay is correctly installed and running. ServBay should have placed thezhparser
library file in PostgreSQL's extension directory. If the issue persists, check whether ServBay and PostgreSQL installations are complete, or try restarting ServBay. - Q: Custom dictionary is not taking effect?
A: Please check the following:- Has the
zhparser.dict_path
parameter been correctly set to your custom dictionary file path (/Applications/ServBay/etc/scws/custom_dict.txt
)? Remember the path is case sensitive. - Did you execute
SELECT zhprs_reload_dict();
after settingzhparser.dict_path
to reload the dictionary? - Is your custom dictionary file formatted correctly (one word per line)?
- If you are testing in a new database session, did you set
SET zhparser.dict_path = ...;
again, or was it added topostgresql.conf
and PostgreSQL restarted? - Does the PostgreSQL user have read permission for the dictionary file?
- Has the
- Q: Full-text search results are not as expected?
A: Check if your text search configuration (chinese
) correctly maps part-of-speech tags to dictionaries. Try adjusting thezhparser.seg_with_duality
parameter and observe if this affects the results. UseSELECT to_tsvector('chinese', 'your Chinese text');
to inspect how a particular text is segmented, which helps with debugging. Also, make sure your search query (to_tsquery
) uses the right keywords and logical operators (&
,|
,!
). - Q: Full-text search performance is poor?
A: Check that a GIN index has been created on theto_tsvector(...)
column. For very large datasets, consider optimizing PostgreSQL configurations further or exploring advanced index techniques.
9. Conclusion
With ServBay, implementing Chinese full-text search in PostgreSQL using zhparser
is extremely straightforward. With just a few simple steps, you can enable the extension, configure text search, and leverage ServBay's built-in scws
for custom dictionary support. Mastering the fundamentals of zhparser
and its configuration will greatly enhance your local development environment’s capability to process Chinese text data, providing a solid foundation for building feature-rich Chinese language applications.