R语言 如何快速找到引入NA的地方?

nhn9ugyo  于 2023-11-14  发布在  其他
关注(0)|答案(1)|浏览(123)

我有两个包含地下水数据的图表

df1 <- data.frame('ID' = c('r23', 'r24', 'r25', 'r26'),
                 'depth' = c("deep","deep","shallow","shallow"),
                 'conc1' = c("3.0", "<1.0", "2.5", "10.0"),
                 'conc2' = c(23, 45, 56, 12)
)
df2 <- data.frame('ID' = c('r27', 'r28', 'r29', 'r30'),
                 'depth2' = c("deep","shallow","shallow","deep"),
                 'conc1' = c(5.0, 3.4, 5.2, 1.2),
                 'conc2' = c(56, 76, 45, 23)
)

字符串
我这样定义数值,也是为了捕获具有检测限的数值变量

num_vars_p <- names(df1 %>% select_if(function(x) is.numeric(x) | sum(grepl("<[0-9]", x)) > 0))


然后我像这样融合它们:

pm <- full_join(df1 %>% mutate(across(everything(), as.character)), 
            df2 %>% mutate(across(everything(), as.character)), 
            by = c("ID", "depth" = "depth2", "conc1", "conc2"),
            .keep_all = T) %>% mutate(across(all_of(num_vars_p), as.numeric))


然后我收到警告信息:

Warning message:
Problem while computing `..1 = across(all_of(num_vars_p), as.numeric)`.
ℹ NAs introduced by coercion


我想有一个函数,给我的列名pm和数量的NA已被引入。

ykejflvf

ykejflvf1#

让我们解决两个问题:
1.我们没有r23和朋友,为了你的代码,我将天真地用paste0("A", 1:4)替换它们;
1.我将从full_join中删除.keep_all=TRUE;这不是*_join的参数,在最近的dplyr版本中,在调用的...部分中有任何意外的东西都是错误的。我怀疑这是从distinct(..)调用的结转,或者你打算keep=TRUE(这会产生另一个错误......所以可能不是这个)。

df1 <- structure(list(ID = c("A1", "A2", "A3", "A4"), depth = c("deep", "deep", "shallow", "shallow"), conc1 = c("3.0", "<1.0", "2.5", "10.0"), conc2 = c(23, 45, 56, 12)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(ID = c("A1", "A2", "A3", "A4"), depth2 = c("deep", "shallow", "shallow", "deep"), conc1 = c(5, 3.4, 5.2, 1.2), conc2 = c(56, 76, 45, 23)), class = "data.frame", row.names = c(NA, -4L))

num_vars_p <- df1 %>%
  select_if(function(x) is.numeric(x) | any(grepl("<[0-9]", x))) %>%
  names()
num_vars_p
# [1] "conc1" "conc2"

pm <- full_join(mutate(df1, across(everything(), as.character)),
                mutate(df2, across(everything(), as.character)),
                by = c("ID", "depth" = "depth2", "conc1", "conc2")) %>% 
  mutate(across(all_of(num_vars_p), as.numeric))
# Warning: There was 1 warning in `mutate()`.
# ℹ In argument: `across(all_of(num_vars_p), as.numeric)`.
# Caused by warning:
# ! NAs introduced by coercion

字符串
有了这个,因为它是小数据,我们可以简单地做pm,在这个例子中,第2行conc1NA。我假设你的数据更大,所以我们需要一个替代方案。
假设你不期望NA值 * 任何地方 *,那么你可以简单地使用complete.cases

pm[!complete.cases(pm),]
#   ID depth conc1 conc2
# 2 A2  deep    NA    45


它方便地保留了它的行号(2,因为我们在这里没有使用dqr),并显示了错误的列。
如果你有太多的列,需要减少到这些列,那么我们可以这样做:

pm %>%
  filter(!complete.cases(pm)) %>%
  select(ID, where(anyNA))
#   ID conc1
# 1 A2    NA


(虽然我们失去了行号dplyr的agressive de-row-numbering).我假设ID对你来说很重要,如果其他人也很重要,那么他们很容易添加。
如果你只需要知道哪一行或哪一列,

### which rows
which(!complete.cases(pm))
# [1] 2

### which columns
sapply(pm, anyNA)
#    ID depth conc1 conc2 
# FALSE FALSE  TRUE FALSE 

### which cells
which(is.na(pm), arr.ind = TRUE)
#      row col
# [1,]   2   3

### total
sum(is.na(pm))
# [1] 1

### per-row/column counts
colSums(is.na(pm))
#    ID depth conc1 conc2 
#     0     0     1     0 
rowSums(is.na(pm))
# [1] 0 1 0 0 0 0 0 0

相关问题