使用spark java api 2.2过滤数据集，其中列不是数字？

8ehkhllq 于 2021-05-29 发布在 Hadoop

关注(0)|答案(3)|浏览(466)

我是spark java api的新手。我想筛选列不是数字的数据集。我的数据集ds1是这样的。

+---------+------------+
|  account|    amount  |
+---------+------------+
| aaaaaa  |            |
| aaaaaa  |            |
| bbbbbb  |            |
| 123333  |            |
| 555555  |            |
| 666666  |            |

我想返回数据集ds2，如下所示：

+---------+------------+
|  account|    amount  |
+---------+------------+
| 123333  |            |
| 555555  |            |
| 666666  |            |

我试过了，但身份证不适合我。

ds2=ds1.select("account"). where(dsFec.col("account").isNaN());

有人能指导我用一个示例Spark表达式来解决这个问题吗。

Java hadoop hdfs apache-spark

来源：https://stackoverflow.com/questions/50609219/filter-dataset-using-where-column-is-not-a-number-using-spark-java-api-2-2

3条答案

按热度按时间

eoxn13cs1#

只需强制转换并检查结果是否为空：

ds1.select("account").where(dsFec.col("account").cast("bigint").isNotNull());

赞(0）回复(0）举报 2021-05-29

sxissh062#

您可以定义 udf 函数来检查字符串 account 列是否为数字

UDF1 checkNumeric = new UDF1<String, Boolean>() {
        public Boolean call(final String account) throws Exception {
            return StringUtils.isNumeric(account);
        }
    };

    sqlContext.udf().register("numeric", checkNumeric, DataTypes.BooleanType);

然后使用 callUDF 函数调用 udf 作为

df.filter(callUDF("numeric", col("account"))).show();

这应该给你

+-------+------+
|account|amount|
+-------+------+
| 123333|      |
| 555555|      |
| 666666|      |
+-------+------+

赞(0）回复(0）举报 2021-05-29

jm2pwxwz3#

一种方法是：
scala等效值：

import scala.util.Try
df.filter(r => Try(r.getString(0).toInt).isSuccess).show()

+-------+------+
|account|amount|
+-------+------+
| 123333|      |
| 555555|      |
| 666666|      |
+-------+------+

也可以使用java的try-catch：

df.map(r => (r.getString(0),r.getString(1),{try{r.getString(0).toInt; true
                }catch {
                      case runtime: RuntimeException => {
                        false}
                      }
            })).filter(_._3 == true).drop("_3").show()

+------+---+
|    _1| _2|
+------+---+
|123333|   |
|555555|   |
|666666|   |
+------+---+

赞(0）回复(0）举报 2021-05-29

我来回答

使用spark java api 2.2过滤数据集，其中列不是数字？

3条答案

相关问题

热门标签

最新问答