filter数据集

pxyaymoc  于 2021-07-12  发布在  Spark
关注(0)|答案(3)|浏览(314)

我正在使用regexp筛选一个包含10行的数据集,如下所示:

ID     Product
1      "VENLAFAXINE HCL CAP ER 24HR 37.5 MG (BASE EQUIVALENT)"
2      "MINOXIDIL POWDER"
3      "MENTHOL LOZENGE 10 MG"
4      "ZINC CHLORIDE GRANULES"
5      "CLOPIDOGREL BISULFATE TAB 75 MG (BASE EQUIV)"
6      "METHYLPREDNISOLONE TAB THERAPY PACK 4 MG (21)"
7      "DEXAMETHASONE TAB THERAPY PACK 1.5 MG (7)"
8      "METHYLPREDNISOLONE DOSE P (16)"
9      "MILLIPRED DP (13)"
10     "ZONACORT 7 DAY"

会让它看起来像

ID     Product
6      "METHYLPREDNISOLONE TAB THERAPY PACK 4 MG (21)"
7      "DEXAMETHASONE TAB THERAPY PACK 1.5 MG (7)"
8      "METHYLPREDNISOLONE DOSE P (16)"
9      "MILLIPRED DP (13)"

实际上,我想根据最后一个字符是否是括号内的数字来过滤数据集。我试过使用,但没有用:

SELECT ID, Product
FROM DAT
WHERE product like '%[(][0-9][)]';
iecba09b

iecba09b1#

你可以试着用 RLIKE 要匹配正则表达式模式:

SELECT ID, Product
FROM DAT
WHERE product RLIKE '\\([0-9]+\\)$';
8tntrjer

8tntrjer2#

base R ,我们可以使用 grepl 匹配左括号( \\( )后跟一个或多个数字( \\d+ ),然后是右括号( \\) )最后( $ )绳子的长度

subset(df1, grepl("\\(\\d+\\)$", Product))

# ID                                       Product

# 6  6 METHYLPREDNISOLONE TAB THERAPY PACK 4 MG (21)

# 7  7     DEXAMETHASONE TAB THERAPY PACK 1.5 MG (7)

# 8  8                METHYLPREDNISOLONE DOSE P (16)

# 9  9                             MILLIPRED DP (13)

数据

df1 <- structure(list(ID = 1:10, Product = c("VENLAFAXINE HCL CAP ER 24HR 37.5 MG (BASE EQUIVALENT)", 
"MINOXIDIL POWDER", "MENTHOL LOZENGE 10 MG", "ZINC CHLORIDE GRANULES", 
"CLOPIDOGREL BISULFATE TAB 75 MG (BASE EQUIV)", "METHYLPREDNISOLONE TAB THERAPY PACK 4 MG (21)", 
"DEXAMETHASONE TAB THERAPY PACK 1.5 MG (7)", "METHYLPREDNISOLONE DOSE P (16)", 
"MILLIPRED DP (13)", "ZONACORT 7 DAY")), class = "data.frame", row.names = c(NA, 
-10L))
nnt7mjpx

nnt7mjpx3#

很遗憾,SQLServer不支持正则表达式。但你可以做到:

WHERE product like '%([0-9]%)' AND
      product NOT LIKE '%(%[^0-9]%)'

第一个条件检查是否有在字符串末尾中间有数字的括号。
第二个验证括号之间的所有字符都是数字。
这就是说,这不是完美的,但它将工作,如果没有其他括号中 product .

相关问题