DocsGPT 训练过程中出现错误,

2sbarzqh  于 8个月前  发布在  其他
关注(0)|答案(4)|浏览(100)
  1. docsgpt-worker-1 | [2023-09-11 04:53:00,787: INFO/MainProcess] Task application.app.ingest[a893c966-35c5-401c-b3ac-114e6f05e95d] received
  2. docsgpt-worker-1 | [2023-09-11 04:53:00,841: ERROR/ForkPoolWorker-1] Task application.app.ingest[a893c966-35c5-401c-b3ac-114e6f05e95d] raised unexpected: PdfReadError('EOF marker not found')
  3. docsgpt-worker-1 | Traceback (most recent call last):
  4. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
  5. docsgpt-worker-1 | R = retval = fun(*args, **kwargs)
  6. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
  7. docsgpt-worker-1 | return self.run(*args, **kwargs)
  8. docsgpt-worker-1 | File "/app/application/app.py", line 164, in ingest
  9. docsgpt-worker-1 | resp = ingest_worker(self, directory, formats, name_job, filename, user)
  10. docsgpt-worker-1 | File "/app/application/worker.py", line 66, in ingest_worker
  11. docsgpt-worker-1 | exclude_hidden=exclude, file_metadata=metadata_from_filename).load_data()
  12. docsgpt-worker-1 | File "/app/application/parser/file/bulk.py", line 146, in load_data
  13. docsgpt-worker-1 | data = parser.parse_file(input_file, errors=self.errors)
  14. docsgpt-worker-1 | File "/app/application/parser/file/docs_parser.py", line 28, in parse_file
  15. docsgpt-worker-1 | pdf = PyPDF2.PdfReader(fp)
  16. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/PyPDF2/_reader.py", line 319, in __init__
  17. docsgpt-worker-1 | self.read(stream)
  18. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1415, in read
  19. docsgpt-worker-1 | self._find_eof_marker(stream)
  20. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1471, in _find_eof_marker
  21. docsgpt-worker-1 | raise PdfReadError("EOF marker not found")
  22. docsgpt-worker-1 | PyPDF2.errors.PdfReadError: EOF marker not found
  23. docsgpt-backend-1 | [2023-09-11 04:53:05 +0000] [8] [ERROR] Error handling request /api/task_status?task_id=a893c966-35c5-401c-b3ac-114e6f05e95d
  24. docsgpt-backend-1 | Traceback (most recent call last):
  25. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 136, in handle
  26. docsgpt-backend-1 | self.handle_request(listener, req, client, addr)
  27. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
  28. docsgpt-backend-1 | respiter = self.wsgi(environ, resp.start_response)
  29. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2552, in __call__
  30. docsgpt-backend-1 | return self.wsgi_app(environ, start_response)
  31. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2532, in wsgi_app
  32. docsgpt-backend-1 | response = self.handle_exception(e)
  33. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
  34. docsgpt-backend-1 | response = self.full_dispatch_request()
  35. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1826, in full_dispatch_request
  36. docsgpt-backend-1 | return self.finalize_request(rv)
  37. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1845, in finalize_request
  38. docsgpt-backend-1 | response = self.make_response(rv)
  39. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2157, in make_response
  40. docsgpt-backend-1 | rv = self.json.response(rv)
  41. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 309, in response
  42. docsgpt-backend-1 | f"{self.dumps(obj, **dump_args)}\n", mimetype=mimetype
  43. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 230, in dumps
  44. docsgpt-backend-1 | return json.dumps(obj, **kwargs)
  45. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/__init__.py", line 238, in dumps
  46. docsgpt-backend-1 | **kw).encode(obj)
  47. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 201, in encode
  48. docsgpt-backend-1 | chunks = list(chunks)
  49. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 431, in _iterencode
  50. docsgpt-backend-1 | yield from _iterencode_dict(o, _current_indent_level)
  51. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
  52. docsgpt-backend-1 | yield from chunks
  53. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 438, in _iterencode
  54. docsgpt-backend-1 | o = _default(o)
  55. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 122, in _default
  56. docsgpt-backend-1 | raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")
  57. docsgpt-backend-1 | TypeError: Object of type PdfReadError is not JSON serializabledocsgpt-backend-1 | [2023-09-11 04:53:05 +0000] [8] [ERROR] Error handling request /api/task_status?task_id=a893c966-35c5-401c-b3ac-114e6f05e95d
  58. docsgpt-backend-1 | Traceback (most recent call last):
  59. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 136, in handle
  60. docsgpt-backend-1 | self.handle_request(listener, req, client, addr)
  61. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
  62. docsgpt-backend-1 | respiter = self.wsgi(environ, resp.start_response)
  63. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2552, in __call__
  64. docsgpt-backend-1 | return self.wsgi_app(environ, start_response)
  65. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2532, in wsgi_app
  66. docsgpt-backend-1 | response = self.handle_exception(e)
  67. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
  68. docsgpt-backend-1 | response = self.full_dispatch_request()
  69. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1826, in full_dispatch_request
  70. docsgpt-backend-1 | return self.finalize_request(rv)
  71. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1845, in finalize_request
  72. docsgpt-backend-1 | response = self.make_response(rv)
  73. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2157, in make_response
  74. docsgpt-backend-1 | rv = self.json.response(rv)
  75. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 309, in response
  76. docsgpt-backend-1 | f"{self.dumps(obj, **dump_args)}\n", mimetype=mimetype
  77. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 230, in dumps
  78. docsgpt-backend-1 | return json.dumps(obj, **kwargs)
  79. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/__init__.py", line 238, in dumps
  80. docsgpt-backend-1 | **kw).encode(obj)
  81. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 201, in encode
  82. docsgpt-backend-1 | chunks = list(chunks)
  83. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 431, in _iterencode
  84. docsgpt-backend-1 | yield from _iterencode_dict(o, _current_indent_level)
  85. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
  86. docsgpt-backend-1 | yield from chunks
  87. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 438, in _iterencode
  88. docsgpt-backend-1 | o = _default(o)
  89. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 122, in _default
  90. docsgpt-backend-1 | raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")
  91. docsgpt-backend-1 | TypeError: Object of type PdfReadError is not JSON serializable^Xdocsgpt-worker-1 | [2023-09-11 04:55:44,852: INFO/MainProcess] Task application.app.ingest[0aba0fe2-a64b-4462-8f6b-ce73cd2b3da9] received
  92. docsgpt-worker-1 | [2023-09-11 04:55:44,860: ERROR/ForkPoolWorker-1] Task application.app.ingest[0aba0fe2-a64b-4462-8f6b-ce73cd2b3da9] raised unexpected: PdfReadError('EOF marker not found')
  93. docsgpt-worker-1 | Traceback (most recent call last):
  94. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
  95. docsgpt-worker-1 | R = retval = fun(*args, **kwargs)
  96. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
  97. docsgpt-worker-1 | return self.run(*args, **kwargs)
  98. docsgpt-worker-1 | File "/app/application/app.py", line 164, in ingest
  99. docsgpt-worker-1 | resp = ingest_worker(self, directory, formats, name_job, filename, user)
  100. docsgpt-worker-1 | File "/app/application/worker.py", line 66, in ingest_worker
  101. docsgpt-worker-1 | exclude_hidden=exclude, file_metadata=metadata_from_filename).load_data()
  102. docsgpt-worker-1 | File "/app/application/parser/file/bulk.py", line 146, in load_data
  103. docsgpt-worker-1 | data = parser.parse_file(input_file, errors=self.errors)
  104. docsgpt-worker-1 | File "/app/application/parser/file/docs_parser.py", line 28, in parse_file
  105. docsgpt-worker-1 | pdf = PyPDF2.PdfReader(fp)
  106. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/PyPDF2/_reader.py", line 319, in __init__
  107. docsgpt-worker-1 | self.read(stream)
  108. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1415, in read
  109. docsgpt-worker-1 | self._find_eof_marker(stream)
  110. docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1471, in _find_eof_marker
  111. docsgpt-worker-1 | raise PdfReadError("EOF marker not found")
  112. docsgpt-worker-1 | PyPDF2.errors.PdfReadError: EOF marker not found
  113. docsgpt-backend-1 | [2023-09-11 04:55:49 +0000] [7] [ERROR] Error handling request /api/task_status?task_id=0aba0fe2-a64b-4462-8f6b-ce73cd2b3da9
  114. docsgpt-backend-1 | Traceback (most recent call last):
  115. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 136, in handle
  116. docsgpt-backend-1 | self.handle_request(listener, req, client, addr)
  117. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
  118. docsgpt-backend-1 | respiter = self.wsgi(environ, resp.start_response)
  119. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2552, in __call__
  120. docsgpt-backend-1 | return self.wsgi_app(environ, start_response)
  121. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2532, in wsgi_app
  122. docsgpt-backend-1 | response = self.handle_exception(e)
  123. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
  124. docsgpt-backend-1 | response = self.full_dispatch_request()
  125. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1826, in full_dispatch_request
  126. docsgpt-backend-1 | return self.finalize_request(rv)
  127. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1845, in finalize_request
  128. docsgpt-backend-1 | response = self.make_response(rv)
  129. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2157, in make_response
  130. docsgpt-backend-1 | rv = self.json.response(rv)
  131. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 309, in response
  132. docsgpt-backend-1 | f"{self.dumps(obj, **dump_args)}\n", mimetype=mimetype
  133. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 230, in dumps
  134. docsgpt-backend-1 | return json.dumps(obj, **kwargs)
  135. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/__init__.py", line 238, in dumps
  136. docsgpt-backend-1 | **kw).encode(obj)
  137. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 201, in encode
  138. docsgpt-backend-1 | chunks = list(chunks)
  139. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 431, in _iterencode
  140. docsgpt-backend-1 | yield from _iterencode_dict(o, _current_indent_level)
  141. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
  142. docsgpt-backend-1 | yield from chunks
  143. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 438, in _iterencode
  144. docsgpt-backend-1 | o = _default(o)
  145. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 122, in _default
  146. docsgpt-backend-1 | raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")
  147. docsgpt-backend-1 | TypeError: Object of type PdfReadError is not JSON serializabledocsgpt-backend-1 | [2023-09-11 04:55:50 +0000] [7] [ERROR] Error handling request /api/task_status?task_id=0aba0fe2-a64b-4462-8f6b-ce73cd2b3da9
  148. docsgpt-backend-1 | Traceback (most recent call last):
  149. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 136, in handle
  150. docsgpt-backend-1 | self.handle_request(listener, req, client, addr)
  151. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
  152. docsgpt-backend-1 | respiter = self.wsgi(environ, resp.start_response)
  153. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2552, in __call__
  154. docsgpt-backend-1 | return self.wsgi_app(environ, start_response)
  155. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2532, in wsgi_app
  156. docsgpt-backend-1 | response = self.handle_exception(e)
  157. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
  158. docsgpt-backend-1 | response = self.full_dispatch_request()
  159. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1826, in full_dispatch_request
  160. docsgpt-backend-1 | return self.finalize_request(rv)
  161. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1845, in finalize_request
  162. docsgpt-backend-1 | response = self.make_response(rv)
  163. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2157, in make_response
  164. docsgpt-backend-1 | rv = self.json.response(rv)
  165. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 309, in response
  166. docsgpt-backend-1 | f"{self.dumps(obj, **dump_args)}\n", mimetype=mimetype
  167. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 230, in dumps
  168. docsgpt-backend-1 | return json.dumps(obj, **kwargs)
  169. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/__init__.py", line 238, in dumps
  170. docsgpt-backend-1 | **kw).encode(obj)
  171. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 201, in encode
  172. docsgpt-backend-1 | chunks = list(chunks)
  173. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 431, in _iterencode
  174. docsgpt-backend-1 | yield from _iterencode_dict(o, _current_indent_level)
  175. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
  176. docsgpt-backend-1 | yield from chunks
  177. docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 438, in _iterencode
  178. docsgpt-backend-1 | o = _default(o)
  179. docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 122, in _default
  180. docsgpt-backend-1 | raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")
  181. docsgpt-backend-1 | TypeError: Object of type PdfReadError is not JSON serializable
kgqe7b3p

kgqe7b3p1#

看起来PyPDF2在打开它时有问题,你能把pdf发给我吗?这样我可以检查一下。

mnowg1ta

mnowg1ta2#

它是否有文本,还是只是扫描?

bpsygsoo

bpsygsoo3#

我尝试了几个PDF文件,但都遇到了问题。

jgovgodb

jgovgodb4#

这里也是。

  1. docsgpt-backend-1 | TypeError: Object of type PdfReadError is not JSON serializable
  2. docsgpt-worker-1 | [2023-09-20 12:01:29,085: INFO/MainProcess] Task application.app.ingest[2bbc7947-0467-47fb-8116-c12f49263de1] received
  3. docsgpt-worker-1 | [2023-09-20 12:01:29,092: ERROR/ForkPoolWorker-15] Task application.app.ingest[2bbc7947-0467-47fb-8116-c12f49263de1] raised unexpected: PdfReadError('EOF marker not found')

相关问题