? 在Hadoop的streaming中使用自定义的inputformat和outputformat_Ubuntu_青云站长教程网 bet36游戏_网站下载_bet36多少地方_bet36最新体育官网
欢迎来到站长教程网!

Ubuntu

当前位置:主页 > 服务器教程 > Ubuntu >

在Hadoop的streaming中使用自定义的inputformat和outputformat

时间:2019-11-04|栏目:Ubuntu|点击: 次

Hadoop的streaming中有一个选项是指定输入输出格式化的:

-inputformat?TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName?Optional.?? -outputformat?TextOutputFormat(default)|JavaClassName??Optional.??

但是在0.14版本之后,hadoop不再支持带多个jar包文件,所以,如果要使用自己定义的Inputformat或者Outputformat,就得将对应的class文件加入到hadoop-streaming-1.0.1.jar中去,比如:

jar?uf?../../contrib/streaming/hadoop-streaming-1.0.1.jar?org/apache/hadoop/streaming/*.class???

然后在-inputformat后面就可以直接带类名了。

下面通过一个例子来说明下,实现Map的输入,key为文件名,value为文档的整篇内容:

1.定义自己的InputFormat:

ContentRecordReder.java

package?org.apache.hadoop.streaming;?? ?? import?java.io.IOException;?? ?? //import?org.apache.commons.logging.Log; ?? //import?org.apache.commons.logging.LogFactory; ?? import?org.apache.hadoop.conf.Configuration;?? import?org.apache.hadoop.fs.FSDataInputStream;?? import?org.apache.hadoop.fs.FileSystem;?? import?org.apache.hadoop.fs.Path;?? import?org.apache.hadoop.io.Text;?? import?org.apache.hadoop.io.compress.CompressionCodecFactory;?? import?org.apache.hadoop.mapred.FileSplit;?? import?org.apache.hadoop.mapred.RecordReader;?? ?? import?com.sun.org.apache.commons.logging.Log;?? import?com.sun.org.apache.commons.logging.LogFactory;?? ?? public?class?ContentRecordReder?implements?RecordReader?{?? ????private?static?final?Log?LOG?=?LogFactory.getLog(ContentRecordReder.class.getName());???? ????private?CompressionCodecFactory?compressionCodecs?=?null;???? ????private?long?start;???? ????private?long?pos;???? ????private?long?end;???? ????private?byte[]?buffer;???? ????private?String?keyName;???? ????private?FSDataInputStream?fileIn;???? ???????? ????public?ContentRecordReder(Configuration?job,FileSplit?split)?throws?IOException{???? ????????start?=?split.getStart();?//从中可以看出每个文件是作为一个split的?? ?? ????????end?=?split.getLength()?+?start;?? ????????final?Path?path?=?split.getPath();?? ????????keyName?=?path.toString();???? ????????LOG.info("filename?in?hdfs?is?:?"?+?keyName);???? ????????System.out.println("filename?in?hdfs?is?:?"?+?keyName);?? ????????final?FileSystem?fs?=?path.getFileSystem(job);???? ????????fileIn?=?fs.open(path);???? ????????fileIn.seek(start);???? ????????buffer?=?new?byte[(int)(end?-?start)];???? ????????this.pos?=?start;?? ?? ????}???? ???? ????public?Text?createKey()?{???? ????????return?new?Text();???? ????}???? ???? ????public?Text?createValue()?{???? ????????return?new?Text();???? ????}???? ???? ????public?long?getPos()?throws?IOException{???? ????????return?pos;???? ????}???? ???? ????public?float?getProgress()?{???? ????????if?(start?==?end)?{???? ????????????return?0.0f;???? ????????}?else?{???? ????????????return?Math.min(1.0f,?(pos?-?start)?/?(float)(end?-?start));???? ????????}???? ????}???? ???? ????public?boolean?next(Text?key,?Text?value)?throws?IOException{???? ????????while(pos? ????????????key.set(keyName);???? ????????????value.clear();???? ????????????fileIn.readFully(pos,buffer);???? ????????????value.set(buffer);???? ????????????LOG.info("---内容:?"?+?value.toString());???? ????????????System.out.println("---内容:?"?+?value.toString());?? ????????????pos?+=?buffer.length;???? ????????????LOG.info("end?is?:?"?+?end??+?"?pos?is?:?"?+?pos);???? ????????????return?true;???? ????????}???? ????????return?false;???? ????}???? ???? ????public?void?close()?throws?IOException{???? ????????if(fileIn?!=?null)?{???? ????????????fileIn.close();???? ????????}???? ???????????? ????}???? }???

ContentInputFormat.java

上一篇:Hadoop在master查看live nodes为0解决方案

栏????目:Ubuntu

下一篇:没有了

本文标题:在Hadoop的streaming中使用自定义的inputformat和outputformat

本文地址:http://www.jh-floor.com/fuwuqijiaocheng/Ubuntu/118302.html

广告投放 | 联系我们 | 版权申明

重要申明:本站所有的文章、图片、评论等,均由网友发表或上传并维护或收集自网络,属个人行为,与本站立场无关。

如果侵犯了您的权利,请与我们联系,我们将在24小时内进行处理、任何非本站因素导致的法律后果,本站均不负任何责任。

联系QQ:888888 | 邮箱:888888#qq.com(#换成@)

Copyright ? 2002-2017 青云站长教程网 版权所有 琼ICP备xxxxxxxx号