我写了一个小程序
public class EMRSpark {
public static void main(String[] args) throws IOException {
String logFile = "s3n://aBucket/aFolder/*.json"; //EMR cluster has access to bucket
SparkConf conf = new SparkConf().setAppName("EMRSpark");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
}
}
pom.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.foo</groupId>
<artifactId>emrspark</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>emrspark</name>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>
</dependencies>
<build>
<finalName>emrspark</finalName>
<plugins>
<!-- download source code in Eclipse, best practice -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<version>2.9</version>
<configuration>
<downloadSources>true</downloadSources>
<downloadJavadocs>false</downloadJavadocs>
</configuration>
</plugin>
<!-- Set a compiler level -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<!-- Maven Assembly Plugin -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<!-- get all project dependencies -->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<!-- MainClass in mainfest make a executable jar -->
<archive>
<manifest>
<mainClass>com.foo.emrspark.EMRSpark</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
我正在努力让它运行在电子病历。我试过(用瘦jar和胖jar)
$ aws emr put --cluster-id anId --key-pair-file bla.pem --src /home/fred-fri/git/emrspark/target/emrspark.jar
$ aws emr add-steps --cluster-id anId --steps Type=Spark,Name="EMRSpark",Args=[--class,com.foo.emrspark.EMRSpark,--deploy-mode,cluster,--master,yarn,/home/hadoop/emrspark.jar]
步骤已创建并运行,但失败。我也尝试过通过awswebgui来实现它们,结果是一样的。
如何让应用程序在emr中运行?
暂无答案!
目前还没有任何答案,快来回答吧!