SCWS User Documentation
SCWS (Simple Chinese Word Segmentation) is an efficient Chinese word segmentation system designed for various Chinese text processing tasks. ServBay comes pre-installed with SCWS and its PHP module. This document provides a detailed guide on installing, configuring, and using SCWS.
Contents
Installation and Configuration
Installation
ServBay has SCWS and its PHP module pre-installed, so no additional installation is required.
Configuration
The SCWS configuration file is located in the /Applications/ServBay/etc/scws
directory with the default configuration file named scws.ini
. You can modify the configuration file as needed to adjust segmentation behavior.
Sample configuration file content:
[charset]
default = utf8
[rule]
rules = /Applications/ServBay/etc/scws/rules.ini
[dict]
dict = /Applications/ServBay/etc/scws/dict.utf8.xdb
2
3
4
5
6
7
8
Basic Usage
SCWS provides the command line tool scws
for text segmentation. Below are some basic usage examples:
Segmentation Examples
Segmentation of a String
echo "这是一个中文分词的例子" | scws -i
Reading Text from a File for Segmentation
scws -i input.txt -o output.txt
Specifying Segmentation Rules
scws -i input.txt -o output.txt -r /path/to/rules.ini
Specifying Dictionary
scws -i input.txt -o output.txt -d /path/to/dict.utf8.xdb
Advanced Usage
Custom Dictionary
You can create a custom dictionary to improve segmentation accuracy. Custom dictionaries must be in xdb
format, which can be generated using the scws-gen
tool.
Creating a Custom Dictionary
Create a dictionary text file
custom_dict.txt
with the following content:custom_word1 1 custom_word2 2
1
2Generate the dictionary using the
scws-gen
tool:bashscws-gen -i custom_dict.txt -o custom_dict.xdb
1Specify the custom dictionary in the configuration file:
[dict] dict = /Applications/ServBay/etc/scws/dict.utf8.xdb,/path/to/custom_dict.xdb
1
2
Adjusting Segmentation Rules
The segmentation rules file rules.ini
defines how segmentation is performed, and you can adjust it as needed. The default rules file is located at /Applications/ServBay/etc/scws/rules.ini
.
Sample rules file content:
[rule]
# Custom segmentation rules
2
Using PHP API
ServBay's PHP comes with the SCWS module pre-installed, allowing you to use SCWS for Chinese word segmentation directly in your PHP code.
Usage Example
- Using SCWS in PHP code:php
<?php // Open SCWS tokenizer $sh = scws_open(); // Set charset scws_set_charset($sh, 'utf8'); // Set dictionary and segmentation rules scws_set_dict($sh, '/Applications/ServBay/etc/scws/dict.utf8.xdb'); scws_set_rule($sh, '/Applications/ServBay/etc/scws/rules.ini'); // Send text for segmentation scws_send_text($sh, "这是一个中文分词的例子"); // Retrieve segmentation results while ($res = scws_get_result($sh)) { foreach ($res as $word) { echo $word['word'], "\n"; } } // Close SCWS tokenizer scws_close($sh); ?>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Common Functions
scws_open()
: Open an SCWS tokenizer instancescws_set_charset($sh, $charset)
: Set the charsetscws_set_dict($sh, $dict_path)
: Set the dictionary pathscws_set_rule($sh, $rule_path)
: Set the segmentation rules pathscws_send_text($sh, $text)
: Send text for segmentationscws_get_result($sh)
: Get segmentation resultsscws_close($sh)
: Close the SCWS tokenizer instance
FAQ
1. Inaccurate SCWS Segmentation Results
- Solution: Verify that the dictionary and rule files are correctly configured. You can try using custom dictionaries and adjusting segmentation rules to improve accuracy.
2. Poor SCWS Performance
- Solution: Ensure SCWS is using an efficient dictionary format (such as
xdb
) and is correctly specified in the configuration file.
3. SCWS Command Line Tool Fails to Run
- Solution: Ensure SCWS is correctly installed and the configuration file paths are accurate. If the issue persists, check the error log for more information.
Summary
SCWS is an efficient Chinese word segmentation system suitable for various Chinese text processing tasks. This document provides a guide on how to install, configure, and use SCWS in ServBay.