2014年1月29日星期三

Running Thrax in Joshua on Hadoop 2.0 Installation.

Joshua 5.0中提供的thrax.jar ($JOSHUA/thrax/bin/thrax.jar)是在hadoop 2.0以前版本编译的. 直接放在hadoop 2.0上跑会出现类似下面的错误:
----------------------------------------------------------------------------------------
2014-01-29 10:52:12,107 FATAL [main] org.apache.hadoop.mapred.
YarnChild: Error running child : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
    at edu.jhu.thrax.hadoop.features.
WordLexicalProbabilityCalculator$Map.map(WordLexicalProbabilityCalculator.java:56)
    at edu.jhu.thrax.hadoop.features.
WordLexicalProbabilityCalculator$Map.map(WordLexicalProbabilityCalculator.java:28)
    at org.apache.hadoop.mapreduce.
Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.
MapTask.runNewMapper(MapTask.java:756)
    at org.apache.hadoop.mapred.
MapTask.run(MapTask.java:338)
    at org.apache.hadoop.mapred.
YarnChild$2.run(YarnChild.java:157)
    at java.security.
AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.
doAs(Subject.java:415)
    at org.apache.hadoop.security.
UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.
YarnChild.main(YarnChild.java:152)

2014-01-29 10:52:12,213 INFO [main] org.apache.hadoop.metrics2.
impl.MetricsSystemImpl: Stopping MapTask metrics system...
2014-01-29 10:52:12,214 INFO [main] org.apache.hadoop.metrics2.
impl.MetricsSystemImpl: MapTask metrics system stopped.
2014-01-29 10:52:12,214 INFO [main] org.apache.hadoop.metrics2.
impl.MetricsSystemImpl: MapTask metrics system shutdown complete.
----------------------------------------------------------------------------------------   
 
 
解决的办法是重新下载thrax, 然后在hadoop2.0下编译, 用新生成的thrax.jar文件去替换原来的thrax.jar.

(参见https://github.com/jweese/thrax/wiki/Quickstart)
1) 下载thrax
git clone https://github.com/jweese/thrax.git
 
2) 编译
ant
 

需要修改build.xml (以下是做过修改的地方). 做修改后, 不需要设置环境变量, 但是要把JAVA_HOME的路径设置正确(/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64)
 
提供正确的jar files.  

 
删除检测是否设置HADOOP, HADOOP_VERSION环境变量 (这两个环境变量的设置主要是为了能够找到合适的jar文件) 


删除对init-amazon的依赖. 这样也不用去设置AWS_SDK环境变量

3) ant中会出现关于amazon的几个错误的解决办法
删除文件src/edu/jhu/thrax/util/amazon/AmazonConfigFileLoader.java
修改文件src/edu/jhu/thrax/util/ConfFileParser.java 
(删除import edu.jhu.thrax.util.amazon.AmazonConfigFileLoader; 
 修改scanner = new Scanner(AmazonConfigFileLoader.getConfigStream(configURI));为scanner = new Scanner(DefaultConfigFileLoader.getConfigStream(configURI));)