ServBay Chinese Full-Text Search: zhparser Usage Guide
zhparser is a powerful third-party PostgreSQL extension specifically designed for efficient handling of Chinese text. It equips PostgreSQL databases with precise Chinese word segmentation and full-text search capabilities, making it an ideal choice for building applications involving Chinese content search. ServBay, as a comprehensive local web development environment, comes with the zhparser extension pre-installed and supports integration with the scws (Simple Chinese Word Segmentation) library, allowing you to leverage custom dictionaries through scws.
This article provides a detailed walkthrough on how to install (enable) and configure zhparser within the ServBay environment, demonstrates its usage for Chinese full-text search, and explains how to create and apply custom dictionaries using ServBay’s built-in scws.
1. Overview
For applications dealing with substantial amounts of Chinese text, such as content management systems, forums, and e-commerce platforms, implementing efficient and accurate full-text search is essential. Although PostgreSQL natively supports full-text search, its default behavior handles Chinese poorly, as it is primarily designed for space-delimited languages. The zhparser extension integrates Chinese word segmentation technology to solve this problem, allowing PostgreSQL to recognize word boundaries in Chinese text and enabling effective full-text search.
ServBay pre-integrates zhparser, sparing developers from the hassle of manually compiling and installing the extension. You can quickly set up a local development environment with Chinese search capabilities out of the box.
2. Prerequisites
Before using zhparser, please ensure that:
- ServBay is successfully installed.
- The PostgreSQL package within ServBay is enabled and running. You can check and manage package status through the ServBay application interface.
3. Installing (Enabling) the zhparser Extension
ServBay has already placed the zhparser module file where PostgreSQL can access it. All you need to do is run an SQL command in your target database to enable it.
Connect to your PostgreSQL database:
Open a terminal and use thepsqlcommand-line tool to connect to the PostgreSQL database managed by ServBay. Replaceservbay-demowith your actual database username, andyour_database_namewith the name of the database wherezhparsershould be enabled.bashpsql -U servbay-demo -d your_database_name1If connecting to the default database (usually the same as the username), you can omit the
-dparameter.Create the
zhparserextension:
In thepsqlinteractive interface, execute the following SQL command:sqlCREATE EXTENSION zhparser;1If the command executes successfully, there should be no error message. If you see a message stating the extension already exists, it means it has already been enabled.
Verify
zhparserinstallation:
You can check the list of installed extensions using the command:sql\dx1In the output, you should find
zhparseralong with its version information.
4. Configuring zhparser
After enabling zhparser, you need to configure PostgreSQL’s text search functionality to use zhparser for Chinese word segmentation. This mainly involves creating a Text Search Configuration.
Create a text search configuration:
A text search configuration defines how documents are converted totsvector(for indexing) and how query strings are converted totsquery(for searching). Let's create a configuration namedchineseand designatezhparseras its PARSER.sqlCREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);1Add dictionary mappings:
The text search configuration also needs to map the types of tokens generated by the parser (zhparser) to dictionaries for processing.zhparserlabels tokens by part-of-speech, such as nouns (n), verbs (v), adjectives (a), etc. Here, we map tokens labeled as nouns (n), verbs (v), adjectives (a), interjections (i), emotion words (e), and numerals (l) to thesimpledictionary. Thesimpledictionary does not alter tokens; it preserves words segmented byzhparseras-is.sqlALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l WITH simple; -- You can add or modify part-of-speech tags and dictionaries as needed.1
2Note: The part-of-speech tags supported by
zhparsermay slightly differ from standard NLP tags. These are the most commonly used tags inzhparser.
5. Using zhparser for Full-Text Search
Once configuration is complete, you can start using zhparser for Chinese full-text search. Here’s a simple demonstration.
5.1 Create a Sample Table and Data
First, create a sample table for storing Chinese text and insert some data.
Create the table:
sqlCREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT );1
2
3
4Insert sample data:
sqlINSERT INTO documents (content) VALUES ('I love natural language processing'), ('Chinese word segmentation is an important step in text processing'), ('zhparser is a great Chinese word segmentation tool'), ('ServBay makes local development more convenient');1
2
3
4
5
5.2 Create a Full-Text Search Index
To improve search performance, especially on large datasets, it is highly recommended to create an index on the column used for full-text search. For columns of type tsvector, a GIN (Generalized Inverted Index) index is typically recommended as it is very efficient for full-text search queries.
Create a GIN index:
We'll create a GIN index on thecontentcolumn. While creating the index, we useto_tsvector('chinese', content)to convert thecontentfield totsvectorformat, specifying our previously createdchinesetext search configuration, so the index leverageszhparserfor word segmentation.sqlCREATE INDEX idx_gin_content ON documents USING gin (to_tsvector('chinese', content));1
5.3 Execute Full-Text Search Queries
Now, you can use to_tsquery to convert keywords into query format and use the @@ operator to perform match queries against the tsvector index column.
Execute a search query:
For example, to search for documents containing both “Chinese” and “word segmentation”:sqlSELECT id, content, to_tsvector('chinese', content) AS content_tsvector -- Optional: view segmentation result FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', '中文 & 分词');1
2
3
4
5
6This query will return documents with
id2 and 3 because theircontentcontains both “Chinese” and “word segmentation”.You can try different queries:
- Search for documents containing “ServBay”:sql(Will return the document with id 4)
SELECT * FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', 'ServBay');1 - Search for documents containing “natural language processing”:sql(Will return the document with id 1. Note that
SELECT * FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', '自然语言处理');1zhparsermay segment “natural language processing” as a whole word or separate it, depending on the segmentation mode and dictionary. If you add it to a custom dictionary, results will improve.)
- Search for documents containing “ServBay”:
6. Creating Custom Dictionaries with ServBay’s Built-In scws
ServBay integrates the scws library, allowing zhparser to use scws dictionary files, including custom dictionaries, to improve segmentation accuracy, especially for domain-specific terms or neologisms.
6.1 Create a Custom Dictionary File
Create or edit a custom dictionary file:
ServBay recommends placing your customscwsdictionary files in/Applications/ServBay/etc/scws/. Create a file namedcustom_dict.txt(if it doesn't already exist).bash# Create or edit the file in terminal nano /Applications/ServBay/etc/scws/custom_dict.txt1
2Add vocabulary to the file:
Incustom_dict.txt, add one word per line that you wantzhparserto recognize as a distinct term. For example:plaintextNatural Language Processing Chinese Word Segmentation ServBay Local Development Environment1
2
3
4Save and close the file.
6.2 Configure zhparser to Use the Custom Dictionary
You need to tell zhparser to use this custom dictionary file.
Set the
zhparser.dict_pathparameter:
In your PostgreSQL session, run the following command to set the path forzhparser’s dictionary:sqlSET zhparser.dict_path = '/Applications/ServBay/etc/scws/custom_dict.txt'; -- Please ensure the path is correct and the PostgreSQL user has read permissions.1
2Note: The
SETcommand only affects the current database session. To make this setting take effect for all new connections, you’ll need to edit PostgreSQL’s configuration file,postgresql.conf, by adding or updatingzhparser.dict_path = '/Applications/ServBay/etc/scws/custom_dict.txt'and then restart the PostgreSQL service (via the ServBay application interface). For local development and testing, usingSETis usually sufficient and convenient.
6.3 Reload the Dictionary
After modifying the dictionary file, you need to instruct zhparser to reload the dictionary in order for the changes to take effect.
Call the reload function:
sqlSELECT zhprs_reload_dict();1After executing this function, subsequent segmentation operations will use the updated dictionary with your custom terms.
7. Adjusting Segmentation Modes
zhparser supports different segmentation modes that affect segmentation granularity. The most commonly used control parameter is zhparser.seg_with_duality.
7.1 Set Segmentation Mode
- Set the
zhparser.seg_with_dualityparameter:- Setting to
true: Enables "dual segmentation" mode for finer granularity, which improves recall (finds more relevant documents). For example, “Natural Language Processing” might be split into “Natural,” “Language,” “Processing,” “Natural Language,” “Language Processing.”sqlSET zhparser.seg_with_duality = true;1 - Setting to
false: Uses coarser segmentation, generally matching the longest terms in the dictionary, which may improve precision. For example, “Natural Language Processing” will usually be one word if in the dictionary.sqlSET zhparser.seg_with_duality = false;1
postgresql.conffile. - Setting to
8. Frequently Asked Questions (FAQ)
- Q:
CREATE EXTENSION zhparser;shows the extension cannot be found?
A: Ensure that the PostgreSQL package in ServBay is correctly installed and running. ServBay should have placed thezhparserlibrary file in PostgreSQL's extension directory. If the issue persists, check whether ServBay and PostgreSQL installations are complete, or try restarting ServBay. - Q: Custom dictionary is not taking effect?
A: Please check the following:- Has the
zhparser.dict_pathparameter been correctly set to your custom dictionary file path (/Applications/ServBay/etc/scws/custom_dict.txt)? Remember the path is case sensitive. - Did you execute
SELECT zhprs_reload_dict();after settingzhparser.dict_pathto reload the dictionary? - Is your custom dictionary file formatted correctly (one word per line)?
- If you are testing in a new database session, did you set
SET zhparser.dict_path = ...;again, or was it added topostgresql.confand PostgreSQL restarted? - Does the PostgreSQL user have read permission for the dictionary file?
- Has the
- Q: Full-text search results are not as expected?
A: Check if your text search configuration (chinese) correctly maps part-of-speech tags to dictionaries. Try adjusting thezhparser.seg_with_dualityparameter and observe if this affects the results. UseSELECT to_tsvector('chinese', 'your Chinese text');to inspect how a particular text is segmented, which helps with debugging. Also, make sure your search query (to_tsquery) uses the right keywords and logical operators (&,|,!). - Q: Full-text search performance is poor?
A: Check that a GIN index has been created on theto_tsvector(...)column. For very large datasets, consider optimizing PostgreSQL configurations further or exploring advanced index techniques.
9. Conclusion
With ServBay, implementing Chinese full-text search in PostgreSQL using zhparser is extremely straightforward. With just a few simple steps, you can enable the extension, configure text search, and leverage ServBay's built-in scws for custom dictionary support. Mastering the fundamentals of zhparser and its configuration will greatly enhance your local development environment’s capability to process Chinese text data, providing a solid foundation for building feature-rich Chinese language applications.
