在脚本中使用正则表达式屏蔽SSN(bash / perl / python)

vatpfxk5  于 2022-11-15  发布在  Perl
关注(0)|答案(5)|浏览(131)

我尝试写一个小脚本(最好用bash,但python或perl也可以)来屏蔽SSN的前5位数字(格式为123 a 45 a 6789或123-45-6789 -所以它将分别输出XXXXX 6789或XXX-XX-6789)。
我知道我应该可以用sed来完成这个任务,但是我在创建正确的正则表达式时遇到了麻烦(然后我必须进行替换)。

123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

因此,SSN可以出现在一行的开头、中间某处或结尾。
输出(例如,前两行)应屏蔽前5个数字,例如使用X):

XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.

我已经设法得到了一个grep正则表达式,它只正确地匹配我想要的表达式:

grep '\b[0-9]\{3\}-\{0,1\}[0-9]\{2\}-\{0,1\}[0-9]\{4\}\b' testfile

我想我应该能够在sed或awk中使用分组来获得我想要的结果,但是我尝试过的所有方法都不起作用。

hfyxw5xn

hfyxw5xn1#

使用sed

$ sed '/\<[0-9]\{9\}\>\|\<[0-9-]\{11\}\>/{s/[0-9]\{5\}/XXXXX/;s/[0-9]\{3\}-[0-9]\{2\}/XXX-XX/g}' input_file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
ws51t4hk

ws51t4hk2#

用GNU awk为第三个参数到match()gensub()以及\<\>的字边界:

$ awk '
    match($0,/(.*)(\<[0-9]{3}-?[0-9]{2})(-?[0-9]{4}\>.*)/,a) {
        $0 = a[1] gensub(/[0-9]/,"X","g",a[2]) a[3]
    }
1' file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
h4cxqtbf

h4cxqtbf3#

perl -lpe 's/\b[0-9]{3}(-?)[0-9]{2}(-?)([0-9]{4})\b/XXX${1}XX$2$3/g'

您只需要捕获将在输出中结束的内容:(可能的)破折号和最后四位数字。而且,Perl的正则表达式语法消除了不必要的反斜杠,这很好。
(具体来说,在perlregex中,“magic”函数总是附加在 * 没有 * 反斜杠的标点符号上,或者附加在 * 有 * 反斜杠的字母数字上;反斜杠标点符号将始终使其不特殊。)

x9ybnkn6

x9ybnkn64#

假设前8行应用了遮罩(最后3行保持不变):
修改输入文件以在前2行中包括双重匹配的SSN模式:

$ cat testfile
123456789 needs to be matched (and again 123-45-6789)
123-45-6789 does, too (and again 123456789)
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

一个sed想法使用OP的regex的修改版本:

sed -r 's/\b([0-9]{3})(-{0,1})([0-9]{2})(-{0,1}[0-9]{4})\b/XXX\2XX\4/g' testfile

其中:

  • -r-启用扩展的正则表达式支持(无需转义圆括号和大括号)
  • ([0-9]{3})-匹配3位数字(* 第1个捕获组 *)
  • (-{0,1})-匹配可选的-(* 第二个捕获组 *)
  • ([0-9]{2})-匹配2位数字(* 第3个捕获组 *)
  • (-{0,1}[0-9]{4})-匹配可选的- + 4位数字(* 第4个捕获组 *)
  • XXX\2XX\4-用XXX替换第1个捕获组,按原样打印第2个捕获组,用XX替换第3个捕获组,按原样打印第4个捕获组
  • g-应用于一行中的所有匹配项

这会产生:

XXXXX6789 needs to be matched (and again XXX-XX-6789)
XXX-XX-6789 does, too (and again XXXXX6789)
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out
h5qlskok

h5qlskok5#

Grep反转匹配正则表达式(已修复):

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' input-file.txt

Grep选项:

  • -v:反转匹配(打印不匹配的所有内容)。
  • -E:使用模式的扩展正则表达式语法。

正则表达式详细信息:

  • ([^0-9]|^):匹配非数字或行首。
  • [0-9]{3}-?:匹配3个数字,后面可以跟一个短划线。
  • [0-9]{2}-?:匹配2个数字,后面可以跟一个短划线。
  • [0-9]{4}:匹配4位数字。
  • ([^0-9]|$):比对非数字或行尾。

测试

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' <<'EOF'
123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
EOF

测试输出:

But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

相关问题