在脚本中使用正则表达式屏蔽SSN(bash / perl / python)

vatpfxk5 于 2022-11-15 发布在 Perl

关注(0)|答案(5)|浏览(130)

我尝试写一个小脚本（最好用bash，但python或perl也可以）来屏蔽SSN的前5位数字（格式为123 a 45 a 6789或123-45-6789 -所以它将分别输出XXXXX 6789或XXX-XX-6789）。
我知道我应该可以用sed来完成这个任务，但是我在创建正确的正则表达式时遇到了麻烦（然后我必须进行替换）。

123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

因此，SSN可以出现在一行的开头、中间某处或结尾。
输出（例如，前两行）应屏蔽前5个数字，例如使用X）：

XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.

我已经设法得到了一个grep正则表达式，它只正确地匹配我想要的表达式：

grep '\b[0-9]\{3\}-\{0,1\}[0-9]\{2\}-\{0,1\}[0-9]\{4\}\b' testfile

我想我应该能够在sed或awk中使用分组来获得我想要的结果，但是我尝试过的所有方法都不起作用。

perl

来源：https://stackoverflow.com/questions/72635733/using-regex-to-mask-ssn-in-a-script-bash-perl-python

5条答案

按热度按时间

hfyxw5xn1#

使用sed

$ sed '/\<[0-9]\{9\}\>\|\<[0-9-]\{11\}\>/{s/[0-9]\{5\}/XXXXX/;s/[0-9]\{3\}-[0-9]\{2\}/XXX-XX/g}' input_file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

赞(0）回复(0）举报 2022-11-15

ws51t4hk2#

用GNU awk为第三个参数到match()和gensub()以及\<和\>的字边界：

$ awk '
    match($0,/(.*)(\<[0-9]{3}-?[0-9]{2})(-?[0-9]{4}\>.*)/,a) {
        $0 = a[1] gensub(/[0-9]/,"X","g",a[2]) a[3]
    }
1' file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

赞(0）回复(0）举报 2022-11-15

h4cxqtbf3#

perl -lpe 's/\b[0-9]{3}(-?)[0-9]{2}(-?)([0-9]{4})\b/XXX${1}XX$2$3/g'

您只需要捕获将在输出中结束的内容：（可能的）破折号和最后四位数字。而且，Perl的正则表达式语法消除了不必要的反斜杠，这很好。
（具体来说，在perlregex中，“magic”函数总是附加在 * 没有 * 反斜杠的标点符号上，或者附加在 * 有 * 反斜杠的字母数字上;反斜杠标点符号将始终使其不特殊。）

赞(0）回复(0）举报 2022-11-15

x9ybnkn64#

假设前8行应用了遮罩（最后3行保持不变）：
修改输入文件以在前2行中包括双重匹配的SSN模式：

$ cat testfile
123456789 needs to be matched (and again 123-45-6789)
123-45-6789 does, too (and again 123456789)
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

一个sed想法使用OP的regex的修改版本：

sed -r 's/\b([0-9]{3})(-{0,1})([0-9]{2})(-{0,1}[0-9]{4})\b/XXX\2XX\4/g' testfile

其中：

-r-启用扩展的正则表达式支持（无需转义圆括号和大括号）
([0-9]{3})-匹配3位数字（* 第1个捕获组 *）
(-{0,1})-匹配可选的-（* 第二个捕获组 *）
([0-9]{2})-匹配2位数字（* 第3个捕获组 *）
(-{0,1}[0-9]{4})-匹配可选的- + 4位数字（* 第4个捕获组 *）
XXX\2XX\4-用XXX替换第1个捕获组，按原样打印第2个捕获组，用XX替换第3个捕获组，按原样打印第4个捕获组
g-应用于一行中的所有匹配项

这会产生：

XXXXX6789 needs to be matched (and again XXX-XX-6789)
XXX-XX-6789 does, too (and again XXXXX6789)
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out

赞(0）回复(0）举报 2022-11-15

h5qlskok5#

Grep反转匹配正则表达式（已修复）：

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' input-file.txt

Grep选项：

-v：反转匹配（打印不匹配的所有内容）。
-E：使用模式的扩展正则表达式语法。

正则表达式详细信息：

([^0-9]|^)：匹配非数字或行首。
[0-9]{3}-?：匹配3个数字，后面可以跟一个短划线。
[0-9]{2}-?：匹配2个数字，后面可以跟一个短划线。
[0-9]{4}：匹配4位数字。
([^0-9]|$)：比对非数字或行尾。

测试

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' <<'EOF'
123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
EOF

测试输出：

But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

赞(0）回复(0）举报 2022-11-15

我来回答

在脚本中使用正则表达式屏蔽SSN(bash / perl / python)

5条答案

相关问题

热门标签

最新问答