join_by中关键字“within”和“overlaps”的用法

gcuhipw9  于 2023-06-19  发布在  其他
关注(0)|答案(2)|浏览(132)

在R/dplyr帮助文件中,有下面的代码,其中有withinoverlaps,如何理解这两个关键词?谢谢!

library(dplyr)

segments <- tibble(
  segment_id = 1:4,
  chromosome = c("chr1", "chr2", "chr2", "chr1"),
  start = c(140, 210, 380, 230),
  end = c(150, 240, 415, 280)
)

reference <- tibble(
  reference_id = 1:4,
  chromosome = c("chr1", "chr1", "chr2", "chr2"),
  start = c(100, 200, 300, 415),
  end = c(150, 250, 399, 450)
)

样品1:within

by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)

样本2:overlaps

by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)
8ljdwjyq

8ljdwjyq1#

within只在x的范围完全在y的范围内时捕获行。
overlaps如果x和y的范围之间存在任何类型的重叠,则捕获行。* * 但是**它不捕获完全在其中的行,即如果x_lower > y_lower & x_upper < y_upper.
这样可能更容易理解(注意这里使用了overlap的默认绑定:"[]"

示例:

x_lower = c(1, 10, 5, 10)
x_upper = c(4, 25, 6, 15)

y_lower = c(0, 15, 10, 3)
y_upper = c(10, 16, 20, 30)

df <- data.frame(x_lower, x_upper, y_lower, y_upper)
transform(df, 
          is_within = x_lower >= y_lower & x_upper <= y_upper,
          is_overlap = x_lower <= y_lower & x_upper >= y_upper)

#   x_lower x_upper y_lower y_upper is_within is_overlap
# 1       1       4       0      10      TRUE      FALSE
# 2      10      25      15      16     FALSE       TRUE
# 3       5       6      10      20     FALSE      FALSE
# 4      10      15       3      30      TRUE      FALSE

来自文档:
within(x_lower, x_upper, y_lower, y_upper)
对于[x_lower,x_upper]中的每个范围,这会找到该范围完全落在[y_lower,y_upper]内的任何地方。相当于x_lower>= y_lower,x_upper <= y_upper。
然后呢
overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")
对于[x_lower,x_upper]中的每个范围,这会找到该范围在任何容量中与[y_lower,y_upper]重叠的任何地方。默认情况下相当于x_lower <= y_upper,x_upper>= y_lower。

vsmadaxz

vsmadaxz2#

join_by的文档实际上涵盖了这两个辅助函数。
对于within(我的粗体):
对于[x_lower,x_upper]中的每个范围,这会找到该范围完全福尔斯在[y_lower,y_upper]内的任何地方。等价于x_lower >= y_lower,x_upper <= y_upper。
用于在()内构建的不等式是相同的,无论提供的范围的包容性如何。

library(dplyr)

full_join(segments, reference, by = "chromosome")
# A tibble: 8 × 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
1          1 chr1           140   150            1     100   150 # yes
2          1 chr1           140   150            2     200   250 # both x smaller than y
3          2 chr2           210   240            3     300   399 # both x smaller than y
4          2 chr2           210   240            4     415   450 # both x smaller than y
5          3 chr2           380   415            3     300   399 # x$end (415) outside range
6          3 chr2           380   415            4     415   450 # x$start (380) outside range
7          4 chr1           230   280            1     100   150 # both x greater than y
8          4 chr1           230   280            2     200   250 # x$end (280) outside range

因此,join_by(within())给出:

by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, join_by(chromosome, within(x$start, x$end, y$start, y$end)))

# A tibble: 1 × 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
1          1 chr1           140   150            1     100   150

对于overlaps(粗体):
对于[x_lower,x_upper]中的每个范围,这会找到该范围在任何容量中与[y_lower,y_upper]重叠的任何地方。默认情况下相当于x_lower <= y_upper,x_upper >= y_lower
边界可以是“[]”、“[)”、“(]”或“()”之一,以改变下限和上限的包含性。“[]”使用<=和>=,但其他3个选项使用< and >并生成完全相同的不等式。

# A tibble: 8 × 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
1          1 chr1           140   150            1     100   150 # yes
2          1 chr1           140   150            2     200   250 # x$end (150) smaller than y$start (200)
3          2 chr2           210   240            3     300   399 # x$end (240) smaller than y$start (300)
4          2 chr2           210   240            4     415   450 # x$end (240) smaller than y$start (415)
5          3 chr2           380   415            3     300   399 # yes 
6          3 chr2           380   415            4     415   450 # yes
7          4 chr1           230   280            1     100   150 # x$start (230) > y$end (150)
8          4 chr1           230   280            2     200   250 # yes

因此,join_by(overlaps())给出:

# A tibble: 5 × 7
  segment_id chromosome start.x end.x reference_id start.y end.y
       <int> <chr>        <dbl> <dbl>        <int>   <dbl> <dbl>
1          1 chr1           140   150            1     100   150
2          2 chr2           210   240           NA      NA    NA
3          3 chr2           380   415            3     300   399
4          3 chr2           380   415            4     415   450
5          4 chr1           230   280            2     200   250

相关问题