2013年3月29日星期五

cdec一些細節

1) 讀取cdec.ini文件
line484: decoder.cc
DecoderImpl::DecoderImpl() at decoder.cc:484
Decoder::Decoder() at decoder.cc:740
main() at cdec.cc:14

2) 讀取weight.init文件
line15:weights.cc
Weights::InitFromFile() at weights.cc
DecoderImpl::DecoderImpl() at decoder.cc:571

Decoder::Decoder() at decoder.cc:740
main() at cdec.cc:14

3) 讀取language model文件
line91: ff_factory.h:92
FactoryRegistry::Create() at ff_factory.h:92
make_ff() at decoder.cc:131
DecoderImpl::DecoderImpl() at decoder.cc:604
Decoder::Decoder() at decoder.cc:740
main() at cdec.cc:14

4) 添加glue grammar
line51: scfg_translator.cc
GlueGrammar::GlueGrammar() at scfg_translator.cc:51
SCFGTranslatorImpl::SCFGTranslatorImpl() at scfg_translator.cc:134
SCFGTranslator::SCFGTranslator() at scfg_translator.cc:345
DecoderImpl::DecoderImpl() at decoder.cc:667
Decoder::Decoder() at decoder.cc:740
main() at cdec.cc:14


5) 讀取grammar文件
line303: rule_lexer.ll
RuleLexer::ReadRules() at rule_lexer.ll:303
TextGrammar::ReadFromStream() at grammar.cc:123
TextGrammar::ReadFromFile() at grammar.cc:119
TextGrammar::TextGrammar() at grammar.cc:80
SCFGTranslator::ProcessMarkupHintsImpl() at scfg_translator.cc:363
Translator::ProcessMarkupHints() at translator.cc:21
DecoderImpl::Decoder() at decoder.cc:791
Decoder::Decoder() at decoder.cc:746
main() at cdec.cc:30

6) input sentence轉換為lattice structure
line58: lattice.cc
LatticeTools::ConvertTextOrPLF() at lattice.cc:58
SCFGTranslatorImpl::Translate() at scfg_translator.cc:186
SCFGTranslator::TranslateImpl() at scfg_translator.cc:354
Translator::Translate() at translator.cc:33
DecoderImpl::Decode() at decoder.cc:794
Decoder::Decode() at decoder.cc:746
main() at cdec.cc:30

7) 添加pass through grammar
line65: scfg_translator.cc
PassThroughGrammar::PassThroughGrammar() at scfg_translator.cc:65
SCFGTranslatorImpl::Translate() at scfg_translator.cc:190
SCFGTranslator::TranslateImpl() at scfg_translator.cc:354
Translator::Translate() at translator.cc:33
DecoderImpl::Decode() at decoder.cc:794
Decoder::Decode() at decoder.cc:746
main() at cdec.cc:30



取消cdec中phrase長度限制

cdec中有三個參數來表示phrase span的單詞最大個數.

1) 抽取規則時, phrase在training sentences中的最大長度. 該變量的定義在sa-extract/sa-compile.pl中, my $max_size = 15; 可以通過sa-extract/sa-compile.pl -p max-size=10來修改;

2) 還是抽取規則時, phrase在test sentences中的最大長度. 既對於test sentences中的某個phrase, 如果長度超過該值, 那麼不去從training sentences中找匹配規則. 該變量的值在sa-extract/extract.ini文件中設置 (作為rulefactory.HieroCachingRuleFactory()的一個參數):
max_initial_size=15,             # maximum span of a grammar rule in TEST DATA

可以修改為max_initial_size=max_size, 既1)提到的phrase在training sentences中的最大長度. 雖然在sa-extract/extract.ini中也設置了max_size的值, 但該值會被sa-extract/sa-compile.pl中的值覆蓋掉.

3) Decoding的時候, scfg的長度, 在配置文件cdec.ini中設置. 如scfg_max_span_limit=12

往cdec取消phrase span單詞數的限制, 那麼需要修改以上三個參數, 比如:

1) 調用sa-compile.pl時, 設置-p, sa-extract/sa-compile.pl -p max-size=100000
2) 修改sa-extract/extract.ini, 將max_initial_size=15改為max_initial_size=max_size
    或者
    在調用sa-compile.pl生成的extract.ini文件中, 將max_initial_size=15改為max_initial_size=max_size
    (注: sa-extract/extract.ini是一個extract.ini的模板文件, sa-compile.pl輸出的extract.ini文件是以該文件為模板)
3)修改cdec.ini文件, 將scfg_max_span_limit=12改為scfg_max_span_limit=100000

2013年3月4日星期一

使用zpar的分词功能

1) 下载zpar http://faculty.sutd.edu.sg/~yue_zhang/doc/index.html
2) 阅读Chinese word segmentation文档http://faculty.sutd.edu.sg/~yue_zhang/doc/doc/segmenter.html 写得很清楚, zan
3) 编译make segmentor (将生成dict/segmentor/***)
4) 训练
     ./dist/segmentor/train <train-file> <model-file> <number of iterations>
     ./dist/segmentor/train ctb6.0/ctb6.0.train.segmented.txt ctb6.0/ctb6.0.seg.model 1
    (最后一个参数是迭代次数, 可以使用test.sh多次迭代然后选择最优值, 见Chinese word segmentation文档6 How to tune the performance of a system)
    (下载evaluate.py http://faculty.sutd.edu.sg/~yue_zhang/doc/doc/seg_files/evaluate.py)
    (需要修改test.sh里的一些文件名称路径等, 每次迭代会在原先迭代的基础上完成, 即为什么test.sh中总是执行$segmentor/train train.txt model 1)

#####test.sh#####
segmentor=dist/segmentor

rm model

for i in `seq 1 30`;
do
echo "iteration $i"
$segmentor/train ctb6.0/ctb6.0.train.segmented.txt model 1
$segmentor/segmentor model ctb6.0/ctb6.0.deve.unseg.txt ctb6.0.deve.autoseg.txt
python evaluate.py ctb6.0.deve.autoseg.txt ctb6.0/ctb6.0.deve.segmented.txt
cp model ctb6.0/ctb6.0.seg.model.$i
done
迭代18次的log:

239 iteration 18
240
241 Training started ...
242 Loading model ... done (8.39196s).
243 Computing averaged feature scores ... done
244 Saving model ... done.
245 Done.
246 Training has finished successfully. Total time taken is: 49.2729
247 Segmenting started
248 Loading model ... done (8.06785s).
249 Segmenting has finished successfully. Total time taken is: 11.2431
250 Word precision: 0.955604476243
251 Word recall: 0.955700108415
252 Word F-measure: 0.955652289936




5) 测试
    ./dist/segmentor/segmentor ctb6.0/ctb6.0.seg.model.18 ctb6.0/ctb6.0.test.unseg.txt ctb6.0.test.autoseg.txt

python evaluate.py ctb6.0.test.autoseg.txt ctb6.0/ctb6.0.test.segmented.txt

Word precision: 0.951947892344
Word recall: 0.949520704111
Word F-measure: 0.950732749098



   
 

使用stanford segmenter

1) 下载 http://nlp.stanford.edu/software/segmenter.shtml 当前最新版本为
1.6.72012-11-11. 大小为200M+, 其中的中文训练数据ctb.gz和pku.gz占了大概200M. 
将jar -xf stanford-segmenter-1.6.7-sources.jar 解压的文件夹放入src文件夹中.
用ant编译, 得到classes/文件夹

(jar xf stanford-segmenter-1.6.7-sources.jar
 mkdir src
 mv edu src/edu
 ant
 jar cvf seg.jar -C classes/ .)
这样生成新的seg.jar文件


2) 训练
  从ctb6.0中抽取了数据(train/test/deve按xue cl 08划分, train/test/deve:23420/2796/2079句)
  ctb_v6/data/utf8/bracketed/chtb_3095.fid的第89行((WHNP-4改为(CP (WHNP-4, 同时将第91行最前面的CP去除, 该句子为((IP (NP (CP (CP (IP (NP (NN 地方)(NN 当局))(VP (ADVP (AD 正在))(VP (VV 组织))))(DEC 的)))(ADJP (JJ 紧急))(NP (NN 求援)))(PU ,)(ADVP (AD 但是))(PP (P 据)(NP (NN 报道)))(PU ,)(NP (CP (CP (IP (VP (QP (CD 1多))(VP (VA 深))))(DEC 的)))(NP (NN 大雪)))(VP (PP (P 给)(NP (NN 求援)(NN 工作)))(VP (VV 造成)(AS 了)(NP (CP (CP (IP (VP (ADVP (AD 极))(VP (VA 大))))(DEC 的)))(NP (NN 困难)))))(PU 。)))

下载http://nlp.stanford.edu/software/trainSegmenter-20080521.tar.gz
大小为100M, 其中包括了中文ctb6.0的数据.
1) 准备 ctb6.train, ctb6.dev, and ctb6.test (需要按照ctb6.0的格式来准备, 会用到postagged信息)
2) 修改Makefile里面的
CTB6=/path/to/ctb6.0/
SEGMENTER=/path/to/stanford-chinese-segmenter-2012-11-11/
修改time java6 --> time java
3) make all
如果在第三步出错 可查看logs/ctb6.err
在8g内存机器上, 大概需要1个小时.

在Makefile里发现在训练的时候需要用到-serDictionary $(SEGMENTER)/data/dict-chris6.ser.gz
不知道dict-chris6.ser.gz内容是什么, 但应该不是从我提供的训练集中得到的...
(如果忽略这个东西, 把训练生成的mode/ctb6.ser.gz移到stanford-segmenter-2012-11-11/data/下, 去 替换ctb6.gz(还未试过))