我试图通过hotonworks在实际集群中实现数据管道示例。我在集群中安装了hdp2.2版本,但是在进程和数据集选项卡的ui中出现以下错误
Failed to load data. Error: 400 Bad Request
除了hbase、kafka、knox、ranger、slider和spark之外,我所有的服务都在运行。
我已经阅读了falcon实体规范,该规范描述了集群、提要和进程定义的各个标记,并修改了提要和进程的xml配置文件,如下所示
群集定义
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="primaryCluster" description="Analytics1" colo="Bangalore" xmlns="uri:falcon:cluster:0.1">
<interfaces>
<interface type="readonly" endpoint="hftp://node3.com.analytics:50070" version="2.6.0"/>
<interface type="write" endpoint="hdfs://node3.com.analytics:8020" version="2.6.0"/>
<interface type="execute" endpoint="node1.com.analytics:8050" version="2.6.0"/>
<interface type="workflow" endpoint="http://node1.com.analytics:11000/oozie/" version="4.1.0"/>
<interface type="messaging" endpoint="tcp://node1.com.analytics:61616?daemon=true" version="5.1.6"/>
</interfaces>
<locations>
<location name="staging" path="/user/falcon/primaryCluster/staging"/>
<location name="working" path="/user/falcon/primaryCluster/working"/>
</locations>
<ACL owner="falcon" group="hadoop"/>
</cluster>
提要定义
原始电子邮件源
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
<tags>externalSystem=USWestEmailServers,classification=secure</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<late-arrival cut-off="hours(4)"/>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
<retention limit="days(3)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/falcon/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/none"/>
<location type="meta" path="/none"/>
</locations>
<ACL owner="falcon" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
清洁皮肤饲料
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="cleansedEmailFeed" description="Cleansed customer emails" xmlns="uri:falcon:feed:0.1">
<tags>owner=USMarketing,classification=Secure,externalSource=USProdEmailServers,externalTarget=BITools</tags>
<groups>churnAnalysisDataPipeline</groups>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<clusters>
<cluster name="primaryCluster" type="source">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
<retention limit="days(10)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/user/falcon/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
</locations>
<ACL owner="falcon" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
过程定义
RawEmailIngest过程
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
<tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup,externalSystem=USWestEmailServers</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<outputs>
<output name="output" feed="rawEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailIngestWorkflow" version="2.0.0" engine="oozie" path="/user/falcon/apps/ingest/fs"/>
<retry policy="periodic" delay="minutes(15)" attempts="3"/>
<ACL owner="falcon" group="hadoop"/>
</process>
清理电子邮件过程
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="cleanseEmailProcess" xmlns="uri:falcon:process:0.1">
<tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup</tags>
<clusters>
<cluster name="primaryCluster">
<validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
</cluster>
</clusters>
<parallel>1</parallel>
<order>FIFO</order>
<frequency>hours(1)</frequency>
<timezone>UTC</timezone>
<inputs>
<input name="input" feed="rawEmailFeed" start="now(0,0)" end="now(0,0)"/>
</inputs>
<outputs>
<output name="output" feed="cleansedEmailFeed" instance="now(0,0)"/>
</outputs>
<workflow name="emailCleanseWorkflow" version="5.0" engine="pig" path="/user/falcon/apps/pig/id.pig"/>
<retry policy="periodic" delay="minutes(15)" attempts="3"/>
<ACL owner="falcon" group="hadoop"/>
</process>
我没有对inset.sh、workflow.xml和id.pig文件做任何更改。它们存在于hdfs location/user/falcon/apps/inset/fs(inset.sh和workflow.xml)和/user/falcon/apps/pig(id.pig)中。另外,我不确定是否需要隐藏的.ds\u存储文件,因此没有将它们包含在前面提到的hdfs位置中。
摄取.sh
# !/bin/bash
# curl -sS http://sandbox.hortonworks.com:15000/static/wiki-data.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put wiki-data/*.txt $1
curl -sS http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put enron_with_categories/*/*.txt $1
工作流.xml
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>ingest.sh</exec>
<argument>${feedInstancePaths}</argument>
<file>${wf:appPath()}/ingest.sh#ingest.sh</file>
<!-- <file>/tmp/ingest.sh#ingest.sh</file> -->
<!-- <capture-output/> -->
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
id.清管器
A = load '$input' using PigStorage(',');
B = foreach A generate $0 as id;
store B into '$output' USING PigStorage();
我不太清楚hdp示例的流程是如何进行的,如果有人能澄清这一点,我将非常感激。
具体来说,我不理解$1给出给ingest.sh的参数的来源。我相信它是存储传入数据的hdfs位置。我注意到workflow.xml有一个标签 <argument>${feedInstancePaths}</argument>
.
feedinstancepaths的价值从何而来?我想我得到的错误,因为饲料没有得到正确的位置存储。但这可能是另一个问题。
用户falcon对/user/falcon下的所有hdfs目录也拥有755权限
任何帮助和建议将不胜感激。
1条答案
按热度按时间vs3odd8k1#
您正在运行自己的群集,但本教程需要shell脚本(insect.sh)中指定的资源:
我猜您的集群没有在sandbox.hortonworks.com上找到地址,而且您也没有所需的ressource wiki-data.tar.gz。本教程仅适用于提供的沙盒。