SCWS Chinese Word Segmentation in ServBay: Installation, Configuration & Usage Guide
As a robust local web development environment, ServBay comes pre-integrated with many essential tools and packages for developers. SCWS (Simple Chinese Word Segmentation) is a high-efficiency Chinese segmentation system, crucial for processing Chinese texts in scenarios such as search, NLP, content analytics, and more. ServBay already has SCWS and its PHP module preinstalled, so there's no need for any complicated extra installation steps. This guide provides detailed instructions on how to configure and use SCWS within ServBay, including both command-line tools and PHP API usage.
Overview
SCWS is a high-performance Chinese word segmentation library, especially suitable for scenarios where you need to segment large amounts of Chinese text quickly and accurately. It supports multiple segmentation modes, custom dictionaries, and rules, making it a fundamental tool for building Chinese search, recommendation, and text analysis applications. ServBay has integrated SCWS into its distribution, providing a precompiled PHP extension, which greatly simplifies using SCWS in your local development setup.
Prerequisites
- You have successfully installed and are running ServBay on macOS.
Installation & Configuration
Installation
ServBay is designed to deliver a ready-to-use development environment. As a vital Chinese language processing tool, SCWS is already preinstalled with ServBay. There's no need for additional downloads or compilation. The executable files, configuration files, and dictionaries for SCWS are all centrally located in your ServBay installation directory, typically at /Applications/ServBay/
by default.
Configuration
The default SCWS configuration file can be found at /Applications/ServBay/etc/scws/scws.ini
within your ServBay installation. You can modify this file according to your specific needs to adjust the segmentation behavior, character sets, dictionaries, and rule settings of SCWS.
Here’s an example of the default configuration file:
[charset]
default = utf8
[rule]
rules = /Applications/ServBay/etc/scws/rules.ini
[dict]
dict = /Applications/ServBay/etc/scws/dict.utf8.xdb
2
3
4
5
6
7
8
[charset]
: Specifies the default character set, usually left asutf8
.[rule]
: Specifies the path to the segmentation rules file.[dict]
: Specifies the path to the word dictionary file. You may specify multiple dictionary files separated by commas,
.
Basic Usage: Command-Line Tool
SCWS provides a powerful command-line utility, scws
, which allows you to test or batch-process Chinese text segmentation right in your terminal. ServBay has included the scws
executable in its bin
directory. Typically, /Applications/ServBay/bin
is already added to your system's PATH, enabling you to run scws
commands directly in the terminal.
Segmentation Examples
Below are some basic examples of using the scws
command-line tool:
Segment a String
Pipe a string directly to the scws
command:
echo "这是一个中文分词的例子" | scws -i
Segment Text from a File
Use the -i
option to specify the input file, and -o
to specify the output file:
scws -i input.txt -o output.txt
Specify Segmentation Rules
Use the -r
option to provide a custom rules file path:
scws -i input.txt -o output.txt -r /path/to/your/rules.ini
Specify Custom Dictionary
Use the -d
option to specify a custom dictionary file path:
scws -i input.txt -o output.txt -d /path/to/your/dict.utf8.xdb
Advanced Usage
Custom Dictionaries
For improved segmentation accuracy—especially for industry-specific terminology, person names, place names, or new words—you can create custom dictionaries. SCWS uses an efficient xdb
format dictionary. You can convert a text-format dictionary to an xdb
file using the scws-gen
tool provided by ServBay.
Steps to Create a Custom Dictionary:
Create a text file, e.g.,
custom_dict.txt
. Each line contains a word, optionally followed by a space and its weight (an integer influencing segmentation priority).ServBay 10 Local development environment 8 Chinese word segmentation 9
1
2
3Use the
scws-gen
tool to generate anxdb
dictionary file.scws-gen
is also located in ServBay'sbin
directory.bashscws-gen -i custom_dict.txt -o custom_dict.xdb
1Edit the
[dict]
section of your SCWS configuration file/Applications/ServBay/etc/scws/scws.ini
and add the path to your custom dictionary after the default, separated by a comma.ini[dict] dict = /Applications/ServBay/etc/scws/dict.utf8.xdb,/path/to/your/custom_dict.xdb
1
2Make sure
/path/to/your/custom_dict.xdb
matches where you actually store your custom dictionary.
Tuning Segmentation Rules
The rules file (default: /Applications/ServBay/etc/scws/rules.ini
) defines how SCWS handles ambiguities or complex Chinese structures. Editing the rules file often requires an in-depth understanding of SCWS’s segmentation algorithm. For most users, using the default rules combined with custom dictionaries is sufficient. If you need to tweak the rules, do so carefully and refer to the official SCWS documentation for rule file formats and syntax (if documentation is included with the SCWS version shipped by ServBay).
Sample rules file content (generally contains pattern-matching rules):
[rule]
# Add custom segmentation rules here
# Example: Define a simple rule
# pattern = result
2
3
4
Using the PHP API
For developers building web applications with PHP, the PHP environment in ServBay already comes with the SCWS extension module enabled. This means you don’t need to install or configure any extra PHP extension; you can invoke the SCWS API directly in your PHP code for Chinese text segmentation.
You can verify whether the SCWS extension is enabled by visiting ServBay’s built-in phpinfo()
page.
Usage Example
Here's a sample PHP script demonstrating how to use the SCWS API to perform segmentation:
<?php
// Ensure the SCWS extension is loaded
if (!extension_loaded('scws')) {
die("SCWS extension is not loaded.");
}
// The text to be segmented
$text = "ServBay 是一款强大的本地 Web 开发环境,支持 PHP、Node.js、Python 等多种语言,并集成了 MySQL、Nginx 等软件包。";
// Open an SCWS segmenter instance
$sh = scws_open();
// Set the character set, usually matching your text encoding
scws_set_charset($sh, 'utf8');
// Specify dictionary and rule file paths
// Make sure these are the actual SCWS file paths in the ServBay environment
$dict_path = '/Applications/ServBay/etc/scws/dict.utf8.xdb';
$rule_path = '/Applications/ServBay/etc/scws/rules.ini';
if (!file_exists($dict_path)) {
die("SCWS dictionary file not found: " . $dict_path);
}
if (!file_exists($rule_path)) {
die("SCWS rules file not found: " . $rule_path);
}
scws_set_dict($sh, $dict_path);
scws_set_rule($sh, $rule_path);
// Send the text to be segmented to the SCWS instance
scws_send_text($sh, $text);
// Retrieve segmentation results
echo "Original Text: " . $text . "\n";
echo "Segmentation Result:\n";
// Loop through and print the segmentation results
// $res is an array; each element represents a segmented word with details (word, attribute, etc.)
while ($res = scws_get_result($sh)) {
foreach ($res as $word_info) {
// Print the word itself
echo $word_info['word'] . " ";
// Optionally, print part of speech or weight if needed, e.g.:
// echo "Word: " . $word_info['word'] . ", POS: " . $word_info['attr'] . ", Weight: " . $word_info['idf'] . "\n";
}
}
echo "\n";
// Close the SCWS instance and release resources
scws_close($sh);
?>
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
You can save this code as a .php
file (e.g., segment_test.php
), place it in the website root directory of ServBay (/Applications/ServBay/www/servbay.demo/
if you have a site named servbay.demo
), and then access it via your browser or run it from the terminal using PHP CLI to view the segmentation results.
Common PHP Extension Functions
Here are some commonly used core functions in the SCWS PHP extension:
scws_open()
: Initializes and opens an SCWS segmenter instance. Returns a resource handle on success, orfalse
on failure.scws_set_charset($sh, $charset)
: Sets the character set for the segmenter instance$sh
.scws_set_dict($sh, $dict_path, $mode = SCWS_XDICT_TXT)
: Sets the dictionary path for the segmenter instance$sh
.$mode
specifies the dictionary format;SCWS_XDICT_TXT
means text format (deprecated,xdb
is recommended). Usually, just provide the$dict_path
to thexdb
file.scws_set_rule($sh, $rule_path)
: Sets the rule file path for the segmenter instance$sh
.scws_send_text($sh, $text)
: Submits the$text
to be segmented to the segmenter instance$sh
for processing.scws_get_result($sh)
: Retrieves segmentation results from the segmenter instance$sh
. Returns a detailed array for each chunk until processing is complete, then returnsfalse
.scws_close($sh)
: Closes the segmenter instance$sh
and releases resources.
For more advanced functions (such as ignoring punctuation, segmentation modes, retrieving word weights, and more), refer to the SCWS PHP extension official documentation.
Frequently Asked Questions (FAQ)
1. What should I do if SCWS segmentation results are inaccurate?
- Solution: First, check that the
dict
andrule
file paths in the configuration file/Applications/ServBay/etc/scws/scws.ini
are correct and that these files exist and are readable. For domain-specific texts or new words, it's recommended to create a custom dictionary (usescws-gen
to generate thexdb
format), and add your custom dictionary path to the configuration. Adjusting word weights or segmentation rules may help further, but requires more in-depth knowledge.
2. What if SCWS is slow or segmentation performance is poor?
- Solution: Ensure that SCWS is using the optimized
xdb
dictionary format rather than the older text format. Thexdb
format provides faster loading and lookups. In your config, make sure the dictionary path points to anxdb
file. For large texts, consider splitting the processing into smaller chunks.
3. What if the SCWS command-line tool can't be found or won't run?
- Solution: This usually means ServBay’s executable directory isn't in your system PATH variable. Try running the command using the full path, e.g.,
/Applications/ServBay/bin/scws -i ...
. Alternatively, add/Applications/ServBay/bin
to your shell profile (such as~/.bash_profile
,~/.zshrc
, etc.), reload the profile, or restart your terminal.
4. Why does scws_open()
fail or the function is missing in PHP?
- Solution: This means the SCWS PHP extension isn't loaded in your ServBay PHP environment. Check the active PHP version in ServBay, then view its
phpinfo()
page (ServBay often provides a shortcut) to see ifscws
is listed and enabled. If not enabled, check your PHP configuration file (php.ini
) for a line likeextension=scws.so
, and make surescws.so
exists in the PHP extensions directory (which ServBay pre-configures). If issues persist, try restarting the ServBay service.
Summary
SCWS is a powerful and efficient Chinese word segmentation system. With ServBay’s integrated package and PHP extension, developers can easily install, configure, and use SCWS in their local macOS environment—either for text processing via the command-line or dynamic segmentation in PHP applications. By following this guide, you'll quickly get started and integrate SCWS into your projects, enhancing your Chinese text processing capabilities.