囧口常開: 使用zpar的分词功能

1) 下载zpar http://faculty.sutd.edu.sg/~yue_zhang/doc/index.html
2) 阅读Chinese word segmentation文档http://faculty.sutd.edu.sg/~yue_zhang/doc/doc/segmenter.html 写得很清楚, zan
3) 编译make segmentor (将生成dict/segmentor/***)
4) 训练
./dist/segmentor/train <train-file> <model-file> <number of iterations>
./dist/segmentor/train ctb6.0/ctb6.0.train.segmented.txt ctb6.0/ctb6.0.seg.model 1
(最后一个参数是迭代次数, 可以使用test.sh多次迭代然后选择最优值, 见Chinese word segmentation文档6 How to tune the performance of a system)
(下载evaluate.py http://faculty.sutd.edu.sg/~yue_zhang/doc/doc/seg_files/evaluate.py)
(需要修改test.sh里的一些文件名称路径等, 每次迭代会在原先迭代的基础上完成, 即为什么test.sh中总是执行$segmentor/train train.txt model 1)

#####test.sh#####
segmentor=dist/segmentor

rm model

for i in `seq 1 30`;
do
echo "iteration $i"
$segmentor/train ctb6.0/ctb6.0.train.segmented.txt model 1
$segmentor/segmentor model ctb6.0/ctb6.0.deve.unseg.txt ctb6.0.deve.autoseg.txt
python evaluate.py ctb6.0.deve.autoseg.txt ctb6.0/ctb6.0.deve.segmented.txt
cp model ctb6.0/ctb6.0.seg.model.$i
done
迭代18次的log:

239 iteration 18
240
241 Training started ...
242 Loading model ... done (8.39196s).
243 Computing averaged feature scores ... done
244 Saving model ... done.
245 Done.
246 Training has finished successfully. Total time taken is: 49.2729
247 Segmenting started
248 Loading model ... done (8.06785s).
249 Segmenting has finished successfully. Total time taken is: 11.2431
250 Word precision: 0.955604476243
251 Word recall: 0.955700108415
252 Word F-measure: 0.955652289936

5) 测试
./dist/segmentor/segmentor ctb6.0/ctb6.0.seg.model.18 ctb6.0/ctb6.0.test.unseg.txt ctb6.0.test.autoseg.txt

python evaluate.py ctb6.0.test.autoseg.txt ctb6.0/ctb6.0.test.segmented.txt

Word precision: 0.951947892344
Word recall: 0.949520704111
Word F-measure: 0.950732749098

囧口常開

2013年3月4日星期一

使用zpar的分词功能

1 条评论:

关注者

博客归档