java 为什么我的词法分析器表现得好像没有行尾一样?[closed]

llmtgqce  于 2023-04-10  发布在  Java
关注(0)|答案(1)|浏览(139)

已关闭,该问题需要details or clarity,目前不接受回答。
**想要改进此问题?**通过editing this post添加详细信息并澄清问题。

15天前关闭。
Improve this question
我编写了一个Java词法分析器

token.java如下所示

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public enum Token {

    TK_MINUS ("-"), 
    TK_PLUS ("\\+"), 
    TK_MUL ("\\*"), 
    TK_DIV ("/"), 
    TK_NOT ("~"), 
    TK_AND ("&"),  
    TK_OR ("\\|"),  
    TK_LESS ("<"),
    TK_LEG ("<="),
    TK_GT (">"),
    TK_GEQ (">="), 
    TK_EQ ("=="),
    TK_ASSIGN ("="),
    TK_OPEN ("\\("),
    TK_CLOSE ("\\)"), 
    TK_SEMI (";"), 
    TK_COMMA (","), 
    TK_KEY_DEFINE ("define"), 
    TK_KEY_AS ("as"),
    TK_KEY_IS ("is"),
    TK_KEY_IF ("if"), 
    TK_KEY_THEN ("then"), 
    TK_KEY_ELSE ("else"), 
    TK_KEY_ENDIF ("endif"),
    OPEN_BRACKET ("\\{"),
    CLOSE_BRACKET ("\\}"),
    

  STRING ("\"[^\"]+\""), 
    TK_FLOAT ("[+-]?([0-9]*[.])?[0-9]+"),
    TK_DECIMAL("(?:0|[1-9](?:_*[0-9])*)[lL]?"),
    TK_OCTAL("0[0-7](?:_*[0-7])*[lL]?"),
    TK_HEXADECIMAL("0x[a-fA-F0-9](?:_*[a-fA-F0-9])*[lL]?"),
    TK_BINARY("0[bB][01](?:_*[01])*[lL]?"),
    IDENTIFIER ("\\w+");

    private final Pattern pattern;

    Token(String regex) {
        pattern = Pattern.compile("^" + regex);
    }

    int endOfMatch(String s) {
        Matcher m = pattern.matcher(s);

        if (m.find()) {
            return m.end();
        }
        return -1;
    }
}

Lexer类看起来像这样--〉Lexer.java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;

public class Lexer {
    private StringBuilder input = new StringBuilder();
    private Token token;
    private String lexema;
    private boolean exausthed = false;
    private String errorMessage = "";
    private Set<Character> blankChars = new HashSet<Character>();

    public Lexer(String filePath) {
        try (Stream<String> st = Files.lines(Paths.get(filePath))) {
            st.forEach(input::append);
        } catch (IOException ex) {
            exausthed = true;
            errorMessage = "Could not read file: " + filePath;
            return;
        }

        blankChars.add('\r');
        blankChars.add('\n');
        blankChars.add((char) 8);
        blankChars.add((char) 9);
        blankChars.add((char) 11);
        blankChars.add((char) 12);
        blankChars.add((char) 32);

        moveAhead();
    }

    public void moveAhead() {
        if (exausthed) {
            return;
        }

        if (input.length() == 0) {
            exausthed = true;
            return;
        }

        ignoreWhiteSpaces();

        if (findNextToken()) {
            return;
        }

        exausthed = true;
        

        if (input.length() > 0) {
            errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
        }
    }

    private void ignoreWhiteSpaces() {
        int charsToDelete = 0;

        while (blankChars.contains(input.charAt(charsToDelete))) {
            charsToDelete++;
        }

        if (charsToDelete > 0) {
            input.delete(0, charsToDelete);
        }
    }

    private boolean findNextToken() {
        for (Token t : Token.values()) {
            int end = t.endOfMatch(input.toString());

            if (end != -1) {
                token = t;
                lexema = input.substring(0, end);
                input.delete(0, end);
                return true;
            }
        }

        return false;
    }

    public Token currentToken() {
        return token;
    }

    public String currentLexema() {
        return lexema;
    }

    public boolean isSuccessful() {
        return errorMessage.isEmpty();
    }

    public String errorMessage() {
        return errorMessage;
    }

    public boolean isExausthed() {
        return exausthed;
    }
}

我创建了一个类,它可以用来测试这个名为Try.java的词法分析器

package draft;

public class Try {

    public static void main(String[] args) {
        Lexer lexer = new Lexer("C:/Users/eimom/Documents/Input.txt");

        System.out.println("Lexical Analysis");
        System.out.println("-----------------");
        while (!lexer.isExausthed()) {
            System.out.printf("%-18s :  %s \n",lexer.currentLexema() , lexer.currentToken());
            lexer.moveAhead();
        }

        if (lexer.isSuccessful()) {
            System.out.println("Ok! :D");
        } else {
            System.out.println(lexer.errorMessage());
        }
    }
}

因此,假设Input.txt文件包含

>= 
 0x10
 ()
11001100
 -433
 0125
 0x3B

那么我期望的输出是

>=  TK_GEQ
 0x10  TK_HEXADECIMAL
 ( TK_OPEN ,
  )  TK_CLOSE 
11001100 TK_BINARY
 -433 TK_DECIMAL
 0125 TK_OCTAL
 0x3B TK_BINARY

但我却得到了

Lexical Analysis
------------------

>                   :TK_GT
=                   :TK_ASSIGN
0                   :TK_FLOAT 
x10                 :IDENTIFIER
(                   :TK_OPEN
)                   :TK_CLOSE
11001100            :TK_FLOAT
-                   :TK_MINUS
43301250            :TK_FLOAT
x3B                 :IDENTIFIER

我能做些什么来纠正这些问题呢?看起来代码并没有在一行结束,而是继续使用另一行的下一个字符。

woobm2wo

woobm2wo1#

这是你自己使用Files.lines(Path)做的,Files.lines的流包含每行的内容,没有行结束符,所以当你把所有的行组合回input时,你最终得到的文件内容没有换行符。
也许你想用Files.readString(Path)来代替。我也想知道为什么你不使用Reader来逐个字符地读取。这通常比试图读取内存中的整个文件更有效(尽管只有当你想分析非常大的文件时才变得重要)。

相关问题