2008年11月10日星期一

Berkeley Parser的使用

Berkeley Parser的使用
Berkeley Parser可以从http://code.google.com/p/berkeleyparser/下载。下载后,在eclipse下运行其代码,发现实验得到的parsing性能比其文章中的要略低,大概低于1%多点。以下是实验过程:

一、训练模型
1). 解压Berkeley Parser提供的jar包后,将源代码添加到eclipse或jbuilder中,会发现编译不通过,需要将edu.berkeley.nlp.PCFGLA.HierarchicalFullyConnectedAdaptiveLexiconWithFeatures.java的第9行注释掉.

2). 在使用ctb2.0或5.1训练时,需要注意2个地方:
I. edu.berkeley.nlp.PCFGLA.Corpus.java的loadChinese方法中,对中文训练语料集文件的选定.
II. edu.berkeley.nlp.PennTreebanReader.java按PTB的格式从文件中读取树,因此,需要整理中文树库文件,即将新闻编号,句子编号等多余的内容删除(每个句子前的“( (”仍需保留)。使得文件中只保留bracketed格式的数据.
III. edu.berkeley.nlp.PennTreebanReader.java的TreeCollection方法(第136行),是默认为文件名为.mrg,需要与CTB的.fid相一致.

3). 开始训练,在eclipse下,设置edu.berkeley.nlp.PCFGLA.GrammarTrain的参数为-path ../Corpus/xma/ctb2.0/data2 -out ./ctb2.0/ctb2.0_sm5.gr -SMcycles 5 -treebank CHINESE
ctb2.0训练集/开发集/测试集分别包含3485/353/348个句子, 当SMcycles设置为5时,训练耗时15分钟. 当SMcycles设置为6时,训练耗时45分钟. SMcycles值每增加1,所需训练时间呈指数级增长

ctb5.1训练集/开发集/测试集分别包含18103/352/348个句子, 当SMcycles设置为5时,训练耗时15分钟.

二、测试模型
java -jar berkeleyParsem5.gr <> ctb2.0/ctb2.0_test.sm5.txt


三、性能评测
采用标准的评测程序
SMcycles=5的情况下在ctb2.0下的评测结果, 按标准训练/开发/测试划分
-- All --
Number of sentence = 348
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 348
Bracketing Recall = 75.23
Bracketing Precision = 78.11
Bracketing FMeasure = 76.64
Complete match = 24.71
Average crossing = 2.61
No crossing = 45.69
2 or less crossing = 67.82
Tagging accuracy = 91.93

-- len<=40 --
Number of sentence = 300
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 300
Bracketing Recall = 77.59
Bracketing Precision = 80.65
Bracketing FMeasure = 79.09
Complete match = 28.67
Average crossing = 1.68
No crossing = 52.33
2 or less crossing = 75.33
Tagging accuracy = 92.38

SMcycles=6的情况下在ctb2.0下的评测结果, 按标准训练/开发/测试划分
-- All --
Number of sentence = 348
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 348
Bracketing Recall = 74.79
Bracketing Precision = 77.02
Bracketing FMeasure = 75.89
Complete match = 24.43
Average crossing = 2.78
No crossing = 44.83
2 or less crossing = 64.94
Tagging accuracy = 92.17

-- len<=40 --
Number of sentence = 300
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 300
Bracketing Recall = 77.55
Bracketing Precision = 80.15
Bracketing FMeasure = 78.83
Complete match = 28.00
Average crossing = 1.80
No crossing = 51.67
2 or less crossing = 73.67
Tagging accuracy = 92.70



在NAACL07_Improved Inference for Unlexicalized Parsing文中,作者在ctb2.0上得到的性能为<=40words(LP/80.8, LR/80.7), all(LP/78.8, LR/78.5),但其训练集/开发集/测试集分别为chtb026.fid-chtb270.fid/chtb001-chtb025.fid/chtb271.fid-chtb300.fid. 此划分的训练集/开发集/测试集分别包含有3178/307/348句子。
按此划分,SMcycles取值为5,得到的结果为
-- All --
Number of sentence = 348
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 348
Bracketing Recall = 75.65
Bracketing Precision = 78.91
Bracketing FMeasure = 77.25
Complete match = 23.85
Average crossing = 2.49
No crossing = 45.11
2 or less crossing = 70.11
Tagging accuracy = 92.54

-- len<=40 --
Number of sentence = 300
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 300
Bracketing Recall = 78.23
Bracketing Precision = 81.75
Bracketing FMeasure = 79.95
Complete match = 27.67
Average crossing = 1.53
No crossing = 51.67
2 or less crossing = 79.00
Tagging accuracy = 93.10
按此划分, SMcycles取值为6, 得到的结果为
-- All --
Number of sentence = 348
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 348
Bracketing Recall = 75.30
Bracketing Precision = 77.29
Bracketing FMeasure = 76.28
Complete match = 25.00
Average crossing = 2.62
No crossing = 44.25
2 or less crossing = 68.97
Tagging accuracy = 91.84

-- len<=40 --
Number of sentence = 300
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 300
Bracketing Recall = 78.53
Bracketing Precision = 80.74
Bracketing FMeasure = 79.62
Complete match = 28.67
Average crossing = 1.59
No crossing = 51.00
2 or less crossing = 78.00
Tagging accuracy = 92.51




在NAACL07_Improved Inference for Unlexicalized Parsing文中,作者在ctb2.0上得到的性能为<=40words(LP/86.9, LR/85.7), all(LP/84.8, LR/81.9)
采用CTB5.1, 训练集采用chtb001.fid-chtb270.fid + chtb400.fid-chtb1021.fid + chtb1030-chtb1151.fid (当训练集包括文件chtb1022.fid或chtb1129.fid时,对很多测试句子找不到句法分析树,得到的是(())结果).
SMcycles=5的情况下在ctb5.1下的评测结果
-- All --
Number of sentence = 348
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 348
Bracketing Recall = 81.01
Bracketing Precision = 83.63
Bracketing FMeasure = 82.30
Complete match = 30.46
Average crossing = 1.99
No crossing = 53.74
2 or less crossing = 72.41
Tagging accuracy = 95.08

-- len<=40 --
Number of sentence = 299
Number of Error sentence = 0
Number of Skip sentence = 0
Number of Valid sentence = 299
Bracketing Precision = 86.65
Bracketing FMeasure = 85.38
Complete match = 35.45
Average crossing = 1.16
No crossing = 60.87
2 or less crossing = 80.94
Tagging accuracy = 95.58

1 条评论:

yaqi 说...

你好,请问一下你在berkeley parser实验中,用来训练模型的training data是什么格式的?可以粘一段看一下吗?用来parsing 的testing data是仅做了segmented还是也作了pos tagging呢?谢谢!