使用分布式缓存访问hadoop中的maxmindgeoapi

ncgqoxb0 于 2021-06-04 发布在 Hadoop

关注(0)|答案(1)|浏览(380)

我正在写一个mapreduce作业来分析web日志。我的代码旨在将ip地址Map到地理位置，我正在使用maxmindgeoapi(https://github.com/maxmind/geoip-api-java)为此目的。我的代码有一个lookupservice方法，它需要具有ip到位置匹配的数据库文件。我正在尝试使用分布式缓存传递此数据库文件。我试着用两种不同的方法来做
案例1：
运行从hdfs传递文件的作业，但它总是抛出一个错误，说“找不到文件”

sudo -u hdfs hadoop jar \
 WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \
/user/hdfs/GeoLiteCity.dat

或

sudo -u hdfs hadoop jar \
WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \
hdfs://sandbox.hortonworks.com:8020/user/hdfs/GeoLiteCity.dat

驾驶员等级代码：

Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.addCacheFile(new Path(args[2]).toUri());

Map器类代码：

public void setup(Context context) throws IOException
{
URI[] uriList = context.getCacheFiles();
Path database_path = new Path(uriList[0].toString());
LookupService cl = new LookupService(database_path.toString(),
            LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

案例2：通过-files选项从本地文件系统传递文件来运行代码。错误：lookupservice cl=new lookupservice（database\u path）行中出现空指针异常

sudo -u hdfs hadoop jar  \
WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
com.prithvi.mapreduce.logprocessing.ipgeo.GeoLocationDatasetDriver \
-files /tmp/jobs/GeoLiteCity.dat /user/hdfs/input /user/hdfs/out_put \
GeoLiteCity.dat

驱动程序代码：

Configuration conf = getConf();
Job job = Job.getInstance(conf);
String dbfile = args[2];
conf.set("maxmind.geo.database.file", dbfile);

Map程序代码：

public void setup(Context context) throws IOException
{
  Configuration conf = context.getConfiguration();
  String database_path = conf.get("maxmind.geo.database.file");
  LookupService cl = new LookupService(database_path,
            LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

我需要这个数据库文件在我所有的任务跟踪完成这项工作。有谁能给我建议一下正确的方法吗？

hadoop mapreduce distributed-cache geoip

来源：https://stackoverflow.com/questions/25193145/accessing-maxmind-geo-api-in-hadoop-using-distributed-cache

1条答案

按热度按时间

kse8i1jr1#

尝试这样做：
从驱动程序中使用 Job 对象：

job.addCacheFile(new URI("hdfs://localhot:8020/GeoLite2-City.mmdb#GeoLite2-City.mmdb"));

哪里， # 表示要由hadoop创建的别名（符号链接）
之后，您可以在 setup() 方法：

@Override
protected void setup(Context context) {
  File file = new File("GeoLite2-City.mmdb");
}

举个例子：
驱动程序代码：http://goo.gl/coqysa
Map程序代码：http://goo.gl/0sbqqp

赞(0）回复(0）举报 2021-06-04

我来回答

使用分布式缓存访问hadoop中的maxmindgeoapi

1条答案

相关问题

热门标签

最新问答