textract 错误:不平衡的括号 ```markdown 错误:不平衡的括号 ```

ha5z0ras  于 6个月前  发布在  其他
关注(0)|答案(3)|浏览(109)

我尝试使用以下不支持的格式对其进行操作:
textract.process('./test.pyc')
并得到了以下错误:

Exception raised:
    Traceback (most recent call last):
      File "C:\decodertextract.py", line 70, in __decoder_textract
        print textract.process(pathname)
      File "C:\Program Files\Python27\lib\site-packages\textract\parsers\__init__.py", line 72, in process
        raise exceptions.ExtensionNotSupported(ext)
      File "C:\Program Files\Python27\lib\site-packages\textract\exceptions.py", line 21, in __init__
        for e in _get_available_extensions():
      File "C:\Program Files\Python27\lib\site-packages\textract\parsers\__init__.py", line 89, in _get_available_extensions
        ext_re = re.compile(glob_filename.replace('*', "(?P<ext>\w+)"))
      File "C:\Program Files\Python27\lib\re.py", line 194, in compile
        return _compile(pattern, flags)
      File "C:\Program Files\Python27\lib\re.py", line 251, in _compile
        raise error, v # invalid expression
    error: unbalanced parenthesis

Python 2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 20:40:30) [MSC v.1500 64 bit (AMD64)] on win32
版本 = '1.6.1'
我相信程序引发了异常,但实际上是在 re 模块内部崩溃。请问您能否查看一下?

flvlnr44

flvlnr441#

我无法复现这个问题:

# TESTING ON COMMAND LINE INTERFACE
[bash]$ touch test.pyc
[bash]$ textract test.pyc
The filename extension .pyc is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx
[bash]$ textract ./test.pyc
The filename extension .pyc is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx

# TESTING WITH PYTHON SCRIPT
[bash]$ echo "import textract" > blah.py
[bash]$ echo "textract.process('./test.py')" > blah.py
[bash]$ python blah.py
The filename extension .pyc is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

Available extensions include: .csv, .doc, .docx, .eml, .epub, .gif, .htm, .html, .jpeg, .jpg, .json, .log, .mp3, .msg, .odt, .ogg, .pdf, .png, .pptx, .ps, .psv, .rtf, .tff, .tif, .tiff, .tsv, .txt, .wav, .xls, .xlsx

实际文件名中是否有括号或其他内容?

7eumitmz

7eumitmz2#

你好,迪安,
非常感谢你的调查。我尝试了你所做的操作,即处理一个空的.pyc文件。但我仍然遇到了相同的错误。我怀疑这个问题只发生在Windows系统上。
稍后我会尝试进一步调查并提供更多详细信息。

axkjgtzd

axkjgtzd3#

你好!
图书馆在Windows上重新生成错误,因为Windows上的路径带有反斜杠""。
下一个解决方法对我有效(文件:parsers\__init.py__):

# from filenames
    parsers_dir = os.path.join(os.path.dirname(__file__))
    glob_filename = os.path.join(parsers_dir, "*" + _FILENAME_SUFFIX + ".py")
    glob_filename = glob_filename.replace("\\", "/") # <------------------------------ THIS
    ext_re = re.compile(glob_filename.replace('*', "(?P<ext>\w+)"))
    for filename in glob.glob(glob_filename):
        filename = filename.replace("\\", "/") # <------------------------------------ THIS
        ext_match = ext_re.match(filename)
        ext = ext_match.groups()[0]
        extensions.append(ext)
        extensions.append('.' + ext)

相关问题