gawk或grep:单行和ungreedy

iecba09b  于 2021-07-06  发布在  Java
关注(0)|答案(2)|浏览(388)

我想打印 *.java 所有子目录中具有两个以上类型参数(即 <R ... H> 在下面的示例中)。其中一个文件如下所示(为简洁起见,名称已缩减):
多行.java

class ClazzA<R extends A,
    S extends B<T>, T extends C<T>,
    U extends D, W extends E,
    X extends F, Y extends G, Z extends H>
    extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) { 
    // ... code ...
  }
}

具有预期输出:

ClazzA.java:10: class ClazzA<R extends A,
ClazzA.java:11:     S extends B<T>, T extends C<T>,
ClazzA.java:12:     U extends D, W extends E,
ClazzA.java:13:     X extends F, Y extends G, Z extends H>
ClazzA.java:14:     extends OtherClazz<S> implements I<T> {

但另一个也可能是这样的:
单行.java

class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) { 
    // ... code ...
  }
}

具有预期输出:

ClazzB.java:42: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

不应考虑/打印的文件:
x-no-parameter.java文件

class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {

  public void method(Type<A, B> x) { 
    // ... code ...
  }
}

x-one-parameter.java文件

class ClazzD<R extends A>  // only one type parameter
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) { 
    // ... code ...
  }
}

x-two-parameters.java文件

class ClazzE<R extends A, S extends B<T>>  // only two type parameters
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) { 
    // ... code ...
  }
}

x-two-line-parameters.java文件

class ClazzF<R extends A,  // only two type parameters
    S extends B<T>>        // on two lines
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) { 
    // ... code ...
  }
}

文件中的所有空间都可以 \s+ . extends [...] 以及 implements [...] 紧接着之前 { 是可选的。 extends [...] 在每个类型参数处也是可选的。查看java® 语言规范,8.1。类声明以获取详细信息。
我在用 gawk 在git bash中:

$ gawk --version
GNU Awk 5.0.0, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)

使用:

find . -type f -name '*.java' | xargs gawk -f ws-class-type-parameter.awk > ws-class-type-parameter.log

以及 ws-class-type-parameter.awk :


# /start/ , /end/ ... pattern

# /class ClazzA<.*,.*/      , /{/  {    # 5 lines, OK for ClazzA, but in real it prints classes with 2 or less type parameters, too

# /class ClazzA<.*,.*,/     , /{/  {    # no line with ClazzA, since there's no second ',' on its first line

# /class ClazzA<.*,.*,/s    , /{/  {    # 500.000+(!) lines

# /class ClazzA<.*,.*,/s    , /{/U {    # 500.000+(!) lines

# /class ClazzA<.*,.*,/sU   , /{/U {    # 500.000+(!) lines

 /(?s)class ClazzA<.*,.*,/ , /{/  {    # no line

    match( FILENAME, "/.*/.." )
    print substr( FILENAME, RLENGTH ) ":" FNR ": " $0
}

这发现了所有 *.java 文件…很好,是的 gawk 每一个…很好,但你看到的结果是我尝试后的评论。请注意: ClazzA 文字只是用于测试和mcve。可能是的 \w+ 在现实中,但有500.000+行在数千个文件测试时。。。
如果我在regex101.com上试用的话,它会起作用。嗯,算是吧。我不知道该怎么定义 /start-regex/,/end-regex/ 好了,我又加了一个 .* 介于两者之间。
我从那里拿了旗子,但我找不到描述 gawk 支持标志语法 /.../sU , /.../U 所以我试了一下。一条现在被删除的评论告诉我 awk 支持这一点。
我也试过了 grep :

$ grep --version
grep (GNU grep) 3.1
...
$ grep -nrPf types.grep *.java

对于types.grep:

(?s).*class\s+\w+\s*<.*,.*,.*>.*{

这只会导致singleline.java的输出。 (?s)--perl-regexp, -P 语法和 grep --help 声称支持这一点。

更新

ed morton的答案中的解决方案效果很好,但事实证明有自动生成的文件,其方法如下:

/**more code before here */    
    public void setId(String value) {
        this.id = value;
    }

    /**
     * Gets a map that contains attributes that aren't bound to any typed property on this class.
     * 
     * <p>
     * the map is keyed by the name of the attribute and 
     * the value is the string value of the attribute.
     * 
     * the map returned by this method is live, and you can add new attribute
     * by updating the map directly. Because of this design, there's no setter.
     * 
     * 
     * @return
     *     always non-null
     */
    public Map<QName, String> getOtherAttributes() {
        return otherAttributes;
    }

输出,例如:

AbstractAddressType.java:81:      * Gets a map that contains attributes that aren't bound to any typed property on this class.
AbstractAddressType.java:82:      * 
AbstractAddressType.java:83:      * <p>
AbstractAddressType.java:84:      * the map is keyed by the name of the attribute and 
AbstractAddressType.java:85:      * the value is the string value of the attribute.
AbstractAddressType.java:86:      * 
AbstractAddressType.java:87:      * the map returned by this method is live, and you can add new attribute
AbstractAddressType.java:88:      * by updating the map directly. Because of this design, there's no setter.
AbstractAddressType.java:89:      * 
AbstractAddressType.java:90:      * 
AbstractAddressType.java:91:      * @return
AbstractAddressType.java:92:      *     always non-null
AbstractAddressType.java:93:      */
AbstractAddressType.java:94:     public Map<QName, String> getOtherAttributes() {

还有一些有课堂评论和注解的,比如:

/**
 * This class was generated by Apache CXF 3.3.4
 * 2020-11-30T12:03:21.251+01:00
 * Generated source version: 3.3.4
 *
 */
@WebService(targetNamespace = "urn:SZRServices", name = "SZR")
@XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})
public interface SZR {
// more code after here

输出例如:

SZR.java:13:  * This class was generated by Apache CXF 3.3.4
SZR.java:14:  * 2020-10-12T11:51:35.175+02:00
SZR.java:15:  * Generated source version: 3.3.4
SZR.java:16:  *
SZR.java:17:  */
SZR.java:18: @WebService(targetNamespace = "urn:SZRServices", name = "SZR")
SZR.java:19: @XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})
368yc8dk

368yc8dk1#

在每个unix设备上的任何shell中使用任何posix awk:

$ cat tst.awk
/[[:space:]]*class[[:space:]]*/ {
    inDef = 1
    fname = FILENAME
    sub(".*/","",fname)
    def = out = ""
}
inDef {
    out = out fname ":" FNR ": " $0 ORS

    # Remove comments (not perfect but should work for 99.9% of cases)
    sub("//.*","")
    gsub("/[*]|[*]/","\n")
    gsub(/\n[^\n]*\n/,"")

    def = def $0 ORS
    if ( /{/ ) {
        if ( gsub(/,/,"&",def) > 2 ) {
            printf "%s", out
        }
        inDef = 0
    }
}
$ find tmp -type f -name '*.java' -exec awk -f tst.awk {} +
multiple-lines.java:1: class ClazzA<R extends A,
multiple-lines.java:2:     S extends B<T>, T extends C<T>,
multiple-lines.java:3:     U extends D, W extends E,
multiple-lines.java:4:     X extends F, Y extends G, Z extends H>
multiple-lines.java:5:     extends OtherClazz<S> implements I<T> {
single-line.java:1: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

以上是使用此输入运行的:

$ head tmp/*
==> tmp/X-no-parameter.java <==
class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {

  public void method(Type<A, B> x) {
    // ... code ...
  }
}

==> tmp/X-one-parameter.java <==
class ClazzD<R extends A>  // only one type parameter
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/X-two-line-parameters.java <==
class ClazzF<R extends A,  // only two type parameters
    S extends B<T>>        // on two lines
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/X-two-parameters.java <==
class ClazzE<R extends A, S extends B<T>>  // only two type parameters
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/multiple-lines.java <==
class ClazzA<R extends A,
    S extends B<T>, T extends C<T>,
    U extends D, W extends E,
    X extends F, Y extends G, Z extends H>
    extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) {
    // ... code ...
  }
}

==> tmp/single-line.java <==
class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) {
    // ... code ...
  }
}

以上只是一个最好的努力,没有为语言编写解析器,只是让ops发布示例输入/输出,以继续处理需要处理的内容。

vojdkbi0

vojdkbi02#

注意:出现注解可能会导致这些解决方案失败。
ripgrep (https://github.com/burntsushi/ripgrep)

rg -nU --no-heading '(?s)class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java
``` `-n` 启用行号(如果输出到终端,则这是默认值) `-U` 启用多行匹配 `--no-heading` 默认情况下, `ripgrep` 将分组在文件名下的匹配行显示为标题,此选项使 `ripgrep` 表现得像 `GNU grep` 每个输出行都有文件名前缀 `[^{]*` 是用来代替 `.*` 防止匹配 `,` 以及 `>` 文件中的其他行 `public void method(Type<Q, R> x) {` 将得到匹配 `-m` 选项可用于限制每个输入文件的匹配数,这将提供不必搜索整个输入文件的额外好处
如果将上述regexp与 `GNU grep` ,请注意: `grep` 一次只匹配一行。如果你使用 `-z` 选项, `grep` 将ascii nul视为记录分隔符,它有效地使您能够跨多行进行匹配,前提是输入没有可以阻止这种匹配的nul字符。另一个影响 `-z` 选项是将nul字符附加到每个输出结果(这可以通过将结果管道化到 `tr '\0' '\n'` ) `-o` 选项将只需要打印匹配的部分,这意味着您将无法获得行号前缀
对于给定的任务, `-P` 不需要, `grep -zoE 'class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java | tr '\0' '\n'` 会给你类似的结果 `ripgrep` 命令。但是,您不会得到行号前缀,文件名前缀将只针对每个匹配的部分而不是每个匹配的行,并且您之前不会得到行的其余部分 `class` 之后呢 `{` 

相关问题