Implementing Chinese Full-Text Search with pg_jieba in ServBay
Overview
For languages such as English, PostgreSQL’s built-in full-text search works effectively due to lexical analysis based on spaces and punctuation. However, since Chinese text does not use spaces to separate words, a dedicated segmentation tool is required to break continuous sequences of Chinese characters into words with independent meanings.
pg_jieba
is a third-party extension for PostgreSQL that integrates the popular Jieba Chinese segmentation library. With pg_jieba
, you can efficiently and accurately segment Chinese text within PostgreSQL, enabling robust Chinese full-text search capabilities.
ServBay, as an integrated local web development environment, comes with the pg_jieba
extension pre-installed, eliminating the need for manual compilation and setup. This allows you to develop and test Chinese full-text search functionalities quickly in your local environment.
This guide walks you through enabling, configuring, and using the pg_jieba
extension in ServBay.
Prerequisites
Before using pg_jieba
, ensure you have completed the following:
- ServBay is installed on your macOS system, and the PostgreSQL database is running correctly.
- You are familiar with basic PostgreSQL operations, including connecting to databases and executing SQL statements.
Installing and Enabling pg_jieba
ServBay packages the pg_jieba
extension along with PostgreSQL, so you don’t need to download or compile anything manually. You just need to enable it in your target database with a simple SQL command.
Here’s how to enable the pg_jieba
extension:
Connect to your PostgreSQL database: Open the Terminal and use the
psql
command-line tool to connect to your PostgreSQL database. Replaceyour_username
with your PostgreSQL username andyour_database
with your database name. By default, ServBay usesservbay
orpostgres
as the username and database.bashpsql -U your_username -d your_database
1For example, using the default user and database:
bashpsql -U servbay -d servbay
1Create and enable the
pg_jieba
extension: In thepsql
command line, execute:sqlCREATE EXTENSION pg_jieba;
1If the extension is already installed, running this command again might throw an error—this is normal.
Verify that
pg_jieba
is enabled: List the installed extensions with:sql\dx
1If
pg_jieba
appears in the list, the extension is enabled successfully.
Configuring pg_jieba for Chinese Full-Text Search
Once pg_jieba
is enabled, you need to configure PostgreSQL's text search features to use pg_jieba
for segmentation.
Setting up Text Search Configuration
A text search configuration defines how documents are processed for full-text search, including which parser to use and how to handle various token types.
Create a new text search configuration: Create a configuration named
chinese
that usespg_jieba
as its parser.sqlCREATE TEXT SEARCH CONFIGURATION chinese (PARSER = pg_jieba);
1This configuration tells PostgreSQL to use
pg_jieba
for tokenizing text.Add mappings for segmentation results: The
pg_jieba
parser generates various token types based on parts of speech. To enable indexing and searching for these tokens, map them to a specific dictionary. Here, we map common tags (such as nounn
, verbv
, adjectivea
, etc.) to PostgreSQL’s built-insimple
dictionary, which essentially uses the tokens from the parser as-is without further processing.sqlALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l WITH simple;
1The tags
n,v,a,i,e,l
cover some token types thatpg_jieba
may recognize. You can add or modify these as needed. Common tags include:n
: Nounv
: Verba
: Adjectivei
: Idiome
: Interjectionl
: Colloquialismnr
: Person namens
: Place nament
: Organizationnz
: Other proper nounm
: Numeralq
: Quantifiert
: Time words
: Place wordf
: Direction wordp
: Prepositionc
: Conjunctionu
: Auxiliary wordxc
: Other function wordw
: Punctuationeng
: Englishx
: Non-morpheme character
Typically, you’ll want to index and search for meaningful words like nouns, verbs, and adjectives.
Example: Full-Text Search Using pg_jieba
After configuration, you can perform Chinese full-text search using pg_jieba
. Here’s a simple example:
Creating a Sample Table and Data
First, create a table to store documents and insert some sample Chinese texts.
Create the table:
sqlCREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT );
1
2
3
4Insert sample data:
sqlINSERT INTO documents (content) VALUES ('我爱自然语言处理技术'), ('中文分词是文本处理的重要步骤'), ('pg_jieba是一个很好的中文分词工具,它基于结巴分词库'), ('ServBay 让本地开发变得简单高效');
1
2
3
4
5
Creating a Full-Text Search Index
To optimize search efficiency—especially for large datasets—it’s recommended to create an index on the search column. The PostgreSQL GIN (Generalized Inverted Index) is ideal for full-text search.
Create a GIN index: Use the
to_tsvector
function with your previously createdchinese
configuration to create a GIN index on thecontent
column.to_tsvector('chinese', content)
converts the text incontent
into atsvector
type using thechinese
configuration (i.e. thepg_jieba
parser), which is PostgreSQL’s internal representation for full-text search.sqlCREATE INDEX idx_gin_content ON documents USING gin (to_tsvector('chinese', content));
1
Performing Full-Text Search Queries
Now you can use the to_tsquery
function with the @@
operator to run full-text searches. to_tsquery('chinese', 'your query')
converts your search terms into a tsquery
type using the chinese
configuration. The @@
operator checks if a tsvector
matches a tsquery
.
Run a search query: Find documents containing both “中文” and “分词”:
sqlSELECT id, content FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', '中文 & 分词');
1
2
3
4
5The
&
symbol stands for logical AND intsquery
. You may use|
for OR and!
for NOT.For instance, to find documents containing either “ServBay” or “开发”:
sqlSELECT id, content FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', 'ServBay | 开发');
1
2
3
4
5
Custom Dictionaries
pg_jieba
uses Jieba’s default dictionary for segmentation. In certain scenarios, you may need to add specialized terms (such as technical jargon, product names, etc.) to improve segmentation accuracy.
You can create a custom dictionary file and configure pg_jieba
to use it.
Adding Custom Terms
Create a custom dictionary file: In ServBay’s configuration directory, create a text file such as:
plaintext/Applications/ServBay/etc/pg_jieba/custom_dict.txt
1Note: This is a suggested path; you can choose the most suitable location based on your ServBay installation and preferences.
Add terms to the custom dictionary: Open
custom_dict.txt
with a text editor. Add one custom word per line. Optionally, specify the word frequency and part-of-speech tag separated by spaces. The format isword [frequency [tag]]
. The higher the frequency, the more likely it is to be recognized as a word.plaintext自然语言处理 3 n ServBay 5 eng 结巴分词库 3 n
1
2
3Here,
3 n
means “自然语言处理” has a frequency of 3 and is a noun.5 eng
means “ServBay” has a frequency of 5 and is tagged as English (eng
).Configure
pg_jieba
to use the custom dictionary: In your PostgreSQL session, set thepg_jieba.dict_path
parameter to point to the directory containing your custom dictionary. Note:pg_jieba.dict_path
typically refers to the directory of the dictionary, not a single file. If your custom dictionary shares the directory with the main dictionary or is in the directory specified forpg_jieba
dictionaries, you might not need to modify this parameter, or ServBay’s packaging ofpg_jieba
might have its own requirements. Refer to ServBay’s documentation or experiment to confirm the correctdict_path
setting.If ServBay's
pg_jieba
configuration allows specifying the custom dictionary file directly, or if your custom file is placed in a directory thatpg_jieba
scans by default, the followingSET
command may vary or be unnecessary. The following is based on standard documentation—please confirm with your actual ServBay setup:sqlSET pg_jieba.dict_path = '/Applications/ServBay/etc/pg_jieba/'; -- Assuming ServBay places the main dictionary here and custom_dict.txt is in this folder
1Or, if
pg_jieba.dict_path
can point directly to a file (this is non-standard but presented for completeness):sqlSET pg_jieba.dict_path = '/Applications/ServBay/etc/pg_jieba/custom_dict.txt'; -- Use with caution; verify with your ServBay configuration
1Important: The
SET
command is effective only for the current database session. To make this permanent, modify PostgreSQL’spostgresql.conf
to set thepg_jieba.dict_path
parameter.
Reloading the Dictionary
After modifying the custom dictionary or changing the pg_jieba.dict_path
configuration, instruct pg_jieba
to reload the dictionary for changes to take effect.
Reload the dictionary: Execute this SQL function:
sqlSELECT jieba_reload_dict();
1Once executed successfully, all subsequent segmentation will use the updated dictionary.
Frequently Asked Questions (FAQ)
Q: What should I do if I get the error "extension 'pg_jieba' is not available" when executing
CREATE EXTENSION pg_jieba;
? A: This usually means thepg_jieba
extension files are not correctly installed in PostgreSQL's extension directory or they can't be found. In ServBay,pg_jieba
should be pre-installed. Ensure you are connecting to the PostgreSQL instance provided by ServBay and that your ServBay installation is not corrupted. If the problem persists, try restarting ServBay or checking the ServBay log files.Q: What if my custom dictionary doesn’t take effect? A: Please check the following:
- Ensure the custom dictionary file path is correct and the PostgreSQL user has read permissions for the file.
- Verify the file format: one term per line, optional frequency and part-of-speech tag separated by spaces.
- Make sure you have set the
pg_jieba.dict_path
parameter correctly. Note that theSET
command applies only to the current session; for other sessions or to persist after restarts, you must updatepostgresql.conf
. - Confirm you have executed
SELECT jieba_reload_dict();
to reload the dictionary. - If you updated
postgresql.conf
, be sure to restart the PostgreSQL service.
Q: What should I do if the full-text search results are inaccurate? A: Full-text search accuracy depends on the segmentation results and query formulation.
- Inspect the segmentation: Use the
ts_debug('chinese', 'your text')
function to see how your text is tokenized using thechinese
configuration. This can help you assess whetherpg_jieba
and your custom dictionary are working as intended. - Optimize segmentation configuration: Tune the
ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR ... WITH simple;
statement to include or exclude parts of speech (like filtering out function words, punctuation, etc.). - Refine your search query: Check whether the keywords and logical operators (
&
,|
,!
) in yourto_tsquery
accurately reflect your search intent.
- Inspect the segmentation: Use the
Conclusion
pg_jieba
is a powerful tool for implementing Chinese full-text search in PostgreSQL. With ServBay’s pre-installed pg_jieba
extension, developers can easily enable and configure Chinese word segmentation locally. This step-by-step guide has covered installing pg_jieba
in ServBay, setting up and configuring text search, running full-text queries, and optimizing segmentation with custom dictionaries. Applying these techniques to your projects can significantly enhance the searchability of Chinese content.