Implementing Chinese Full-Text Search with pg_jieba in ServBay
Overview
For languages such as English, PostgreSQL’s built-in full-text search works effectively due to lexical analysis based on spaces and punctuation. However, since Chinese text does not use spaces to separate words, a dedicated segmentation tool is required to break continuous sequences of Chinese characters into words with independent meanings.
pg_jieba is a third-party extension for PostgreSQL that integrates the popular Jieba Chinese segmentation library. With pg_jieba, you can efficiently and accurately segment Chinese text within PostgreSQL, enabling robust Chinese full-text search capabilities.
ServBay, as an integrated local web development environment, comes with the pg_jieba extension pre-installed, eliminating the need for manual compilation and setup. This allows you to develop and test Chinese full-text search functionalities quickly in your local environment.
This guide walks you through enabling, configuring, and using the pg_jieba extension in ServBay.
Prerequisites
Before using pg_jieba, ensure you have completed the following:
- ServBay is installed on your macOS system, and the PostgreSQL database is running correctly.
- You are familiar with basic PostgreSQL operations, including connecting to databases and executing SQL statements.
Installing and Enabling pg_jieba
ServBay packages the pg_jieba extension along with PostgreSQL, so you don’t need to download or compile anything manually. You just need to enable it in your target database with a simple SQL command.
Here’s how to enable the pg_jieba extension:
Connect to your PostgreSQL database: Open the Terminal and use the
psqlcommand-line tool to connect to your PostgreSQL database. Replaceyour_usernamewith your PostgreSQL username andyour_databasewith your database name. By default, ServBay usesservbayorpostgresas the username and database.bashpsql -U your_username -d your_database1For example, using the default user and database:
bashpsql -U servbay -d servbay1Create and enable the
pg_jiebaextension: In thepsqlcommand line, execute:sqlCREATE EXTENSION pg_jieba;1If the extension is already installed, running this command again might throw an error—this is normal.
Verify that
pg_jiebais enabled: List the installed extensions with:sql\dx1If
pg_jiebaappears in the list, the extension is enabled successfully.
Configuring pg_jieba for Chinese Full-Text Search
Once pg_jieba is enabled, you need to configure PostgreSQL's text search features to use pg_jieba for segmentation.
Setting up Text Search Configuration
A text search configuration defines how documents are processed for full-text search, including which parser to use and how to handle various token types.
Create a new text search configuration: Create a configuration named
chinesethat usespg_jiebaas its parser.sqlCREATE TEXT SEARCH CONFIGURATION chinese (PARSER = pg_jieba);1This configuration tells PostgreSQL to use
pg_jiebafor tokenizing text.Add mappings for segmentation results: The
pg_jiebaparser generates various token types based on parts of speech. To enable indexing and searching for these tokens, map them to a specific dictionary. Here, we map common tags (such as nounn, verbv, adjectivea, etc.) to PostgreSQL’s built-insimpledictionary, which essentially uses the tokens from the parser as-is without further processing.sqlALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l WITH simple;1The tags
n,v,a,i,e,lcover some token types thatpg_jiebamay recognize. You can add or modify these as needed. Common tags include:n: Nounv: Verba: Adjectivei: Idiome: Interjectionl: Colloquialismnr: Person namens: Place nament: Organizationnz: Other proper nounm: Numeralq: Quantifiert: Time words: Place wordf: Direction wordp: Prepositionc: Conjunctionu: Auxiliary wordxc: Other function wordw: Punctuationeng: Englishx: Non-morpheme character
Typically, you’ll want to index and search for meaningful words like nouns, verbs, and adjectives.
Example: Full-Text Search Using pg_jieba
After configuration, you can perform Chinese full-text search using pg_jieba. Here’s a simple example:
Creating a Sample Table and Data
First, create a table to store documents and insert some sample Chinese texts.
Create the table:
sqlCREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT );1
2
3
4Insert sample data:
sqlINSERT INTO documents (content) VALUES ('我爱自然语言处理技术'), ('中文分词是文本处理的重要步骤'), ('pg_jieba是一个很好的中文分词工具,它基于结巴分词库'), ('ServBay 让本地开发变得简单高效');1
2
3
4
5
Creating a Full-Text Search Index
To optimize search efficiency—especially for large datasets—it’s recommended to create an index on the search column. The PostgreSQL GIN (Generalized Inverted Index) is ideal for full-text search.
Create a GIN index: Use the
to_tsvectorfunction with your previously createdchineseconfiguration to create a GIN index on thecontentcolumn.to_tsvector('chinese', content)converts the text incontentinto atsvectortype using thechineseconfiguration (i.e. thepg_jiebaparser), which is PostgreSQL’s internal representation for full-text search.sqlCREATE INDEX idx_gin_content ON documents USING gin (to_tsvector('chinese', content));1
Performing Full-Text Search Queries
Now you can use the to_tsquery function with the @@ operator to run full-text searches. to_tsquery('chinese', 'your query') converts your search terms into a tsquery type using the chinese configuration. The @@ operator checks if a tsvector matches a tsquery.
Run a search query: Find documents containing both “中文” and “分词”:
sqlSELECT id, content FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', '中文 & 分词');1
2
3
4
5The
&symbol stands for logical AND intsquery. You may use|for OR and!for NOT.For instance, to find documents containing either “ServBay” or “开发”:
sqlSELECT id, content FROM documents WHERE to_tsvector('chinese', content) @@ to_tsquery('chinese', 'ServBay | 开发');1
2
3
4
5
Custom Dictionaries
pg_jieba uses Jieba’s default dictionary for segmentation. In certain scenarios, you may need to add specialized terms (such as technical jargon, product names, etc.) to improve segmentation accuracy.
You can create a custom dictionary file and configure pg_jieba to use it.
Adding Custom Terms
Create a custom dictionary file: In ServBay’s configuration directory, create a text file such as:
plaintext/Applications/ServBay/etc/pg_jieba/custom_dict.txt1Note: This is a suggested path; you can choose the most suitable location based on your ServBay installation and preferences.
Add terms to the custom dictionary: Open
custom_dict.txtwith a text editor. Add one custom word per line. Optionally, specify the word frequency and part-of-speech tag separated by spaces. The format isword [frequency [tag]]. The higher the frequency, the more likely it is to be recognized as a word.plaintext自然语言处理 3 n ServBay 5 eng 结巴分词库 3 n1
2
3Here,
3 nmeans “自然语言处理” has a frequency of 3 and is a noun.5 engmeans “ServBay” has a frequency of 5 and is tagged as English (eng).Configure
pg_jiebato use the custom dictionary: In your PostgreSQL session, set thepg_jieba.dict_pathparameter to point to the directory containing your custom dictionary. Note:pg_jieba.dict_pathtypically refers to the directory of the dictionary, not a single file. If your custom dictionary shares the directory with the main dictionary or is in the directory specified forpg_jiebadictionaries, you might not need to modify this parameter, or ServBay’s packaging ofpg_jiebamight have its own requirements. Refer to ServBay’s documentation or experiment to confirm the correctdict_pathsetting.If ServBay's
pg_jiebaconfiguration allows specifying the custom dictionary file directly, or if your custom file is placed in a directory thatpg_jiebascans by default, the followingSETcommand may vary or be unnecessary. The following is based on standard documentation—please confirm with your actual ServBay setup:sqlSET pg_jieba.dict_path = '/Applications/ServBay/etc/pg_jieba/'; -- Assuming ServBay places the main dictionary here and custom_dict.txt is in this folder1Or, if
pg_jieba.dict_pathcan point directly to a file (this is non-standard but presented for completeness):sqlSET pg_jieba.dict_path = '/Applications/ServBay/etc/pg_jieba/custom_dict.txt'; -- Use with caution; verify with your ServBay configuration1Important: The
SETcommand is effective only for the current database session. To make this permanent, modify PostgreSQL’spostgresql.confto set thepg_jieba.dict_pathparameter.
Reloading the Dictionary
After modifying the custom dictionary or changing the pg_jieba.dict_path configuration, instruct pg_jieba to reload the dictionary for changes to take effect.
Reload the dictionary: Execute this SQL function:
sqlSELECT jieba_reload_dict();1Once executed successfully, all subsequent segmentation will use the updated dictionary.
Frequently Asked Questions (FAQ)
Q: What should I do if I get the error "extension 'pg_jieba' is not available" when executing
CREATE EXTENSION pg_jieba;? A: This usually means thepg_jiebaextension files are not correctly installed in PostgreSQL's extension directory or they can't be found. In ServBay,pg_jiebashould be pre-installed. Ensure you are connecting to the PostgreSQL instance provided by ServBay and that your ServBay installation is not corrupted. If the problem persists, try restarting ServBay or checking the ServBay log files.Q: What if my custom dictionary doesn’t take effect? A: Please check the following:
- Ensure the custom dictionary file path is correct and the PostgreSQL user has read permissions for the file.
- Verify the file format: one term per line, optional frequency and part-of-speech tag separated by spaces.
- Make sure you have set the
pg_jieba.dict_pathparameter correctly. Note that theSETcommand applies only to the current session; for other sessions or to persist after restarts, you must updatepostgresql.conf. - Confirm you have executed
SELECT jieba_reload_dict();to reload the dictionary. - If you updated
postgresql.conf, be sure to restart the PostgreSQL service.
Q: What should I do if the full-text search results are inaccurate? A: Full-text search accuracy depends on the segmentation results and query formulation.
- Inspect the segmentation: Use the
ts_debug('chinese', 'your text')function to see how your text is tokenized using thechineseconfiguration. This can help you assess whetherpg_jiebaand your custom dictionary are working as intended. - Optimize segmentation configuration: Tune the
ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR ... WITH simple;statement to include or exclude parts of speech (like filtering out function words, punctuation, etc.). - Refine your search query: Check whether the keywords and logical operators (
&,|,!) in yourto_tsqueryaccurately reflect your search intent.
- Inspect the segmentation: Use the
Conclusion
pg_jieba is a powerful tool for implementing Chinese full-text search in PostgreSQL. With ServBay’s pre-installed pg_jieba extension, developers can easily enable and configure Chinese word segmentation locally. This step-by-step guide has covered installing pg_jieba in ServBay, setting up and configuring text search, running full-text queries, and optimizing segmentation with custom dictionaries. Applying these techniques to your projects can significantly enhance the searchability of Chinese content.
