org.apache.tika.Tika.parseToString()方法的使用及代码示例

x33g5p2x  于2022-01-29 转载在 其他  
字(10.6k)|赞(0)|评价(0)|浏览(737)

本文整理了Java中org.apache.tika.Tika.parseToString()方法的一些代码示例,展示了Tika.parseToString()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Tika.parseToString()方法的具体详情如下:
包路径:org.apache.tika.Tika
类名称:Tika
方法名:parseToString

Tika.parseToString介绍

[英]Parses the given file and returns the extracted text content.

To avoid unpredictable excess memory use, the returned string contains only up to #getMaxStringLength() first characters extracted from the input document. Use the #setMaxStringLength(int)method to adjust this limitation.
[中]解析给定文件并返回提取的文本内容。
为了避免不可预测的内存过量使用,返回的字符串最多只包含从输入文档中提取的#getMaxStringLength()第一个字符。使用#setMaxStringLength(int)方法调整此限制。

代码示例

代码示例来源:origin: apache/tika

  1. public static void main(String[] args) throws Exception {
  2. // Create a Tika instance with the default configuration
  3. Tika tika = new Tika();
  4. // Parse all given files and print out the extracted
  5. // text content
  6. for (String file : args) {
  7. String text = tika.parseToString(new File(file));
  8. System.out.print(text);
  9. }
  10. }
  11. }

代码示例来源:origin: apache/tika

  1. public static String parseToStringExample() throws Exception {
  2. File document = new File("example.doc");
  3. String content = new Tika().parseToString(document);
  4. System.out.print(content);
  5. return content;
  6. }

代码示例来源:origin: apache/tika

  1. /**
  2. * Example of how to use Tika's parseToString method to parse the content of a file,
  3. * and return any text found.
  4. * <p>
  5. * Note: Tika.parseToString() will extract content from the outer container
  6. * document and any embedded/attached documents.
  7. *
  8. * @return The content of a file.
  9. */
  10. public String parseToStringExample() throws IOException, SAXException, TikaException {
  11. Tika tika = new Tika();
  12. try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {
  13. return tika.parseToString(stream);
  14. }
  15. }

代码示例来源:origin: apache/tika

  1. public void indexDocument(File file) throws Exception {
  2. Document document = new Document();
  3. document.add(new TextField("filename", file.getName(), Store.YES));
  4. document.add(new TextField("fulltext", tika.parseToString(file), Store.NO));
  5. writer.addDocument(document);
  6. }
  7. }

代码示例来源:origin: apache/tika

  1. /**
  2. * Parses the given document and returns the extracted text content.
  3. * The given input stream is closed by this method.
  4. * <p>
  5. * To avoid unpredictable excess memory use, the returned string contains
  6. * only up to {@link #getMaxStringLength()} first characters extracted
  7. * from the input document. Use the {@link #setMaxStringLength(int)}
  8. * method to adjust this limitation.
  9. * <p>
  10. * <strong>NOTE:</strong> Unlike most other Tika methods that take an
  11. * {@link InputStream}, this method will close the given stream for
  12. * you as a convenience. With other methods you are still responsible
  13. * for closing the stream or a wrapper instance returned by Tika.
  14. *
  15. * @param stream the document to be parsed
  16. * @return extracted text content
  17. * @throws IOException if the document can not be read
  18. * @throws TikaException if the document can not be parsed
  19. */
  20. public String parseToString(InputStream stream)
  21. throws IOException, TikaException {
  22. return parseToString(stream, new Metadata());
  23. }

代码示例来源:origin: apache/tika

  1. /**
  2. * Parses the file at the given path and returns the extracted text content.
  3. * <p>
  4. * To avoid unpredictable excess memory use, the returned string contains
  5. * only up to {@link #getMaxStringLength()} first characters extracted
  6. * from the input document. Use the {@link #setMaxStringLength(int)}
  7. * method to adjust this limitation.
  8. *
  9. * @param path the path of the file to be parsed
  10. * @return extracted text content
  11. * @throws IOException if the file can not be read
  12. * @throws TikaException if the file can not be parsed
  13. */
  14. public String parseToString(Path path) throws IOException, TikaException {
  15. Metadata metadata = new Metadata();
  16. InputStream stream = TikaInputStream.get(path, metadata);
  17. return parseToString(stream, metadata);
  18. }

代码示例来源:origin: apache/tika

  1. /**
  2. * Parses the resource at the given URL and returns the extracted
  3. * text content.
  4. * <p>
  5. * To avoid unpredictable excess memory use, the returned string contains
  6. * only up to {@link #getMaxStringLength()} first characters extracted
  7. * from the input document. Use the {@link #setMaxStringLength(int)}
  8. * method to adjust this limitation.
  9. *
  10. * @param url the URL of the resource to be parsed
  11. * @return extracted text content
  12. * @throws IOException if the resource can not be read
  13. * @throws TikaException if the resource can not be parsed
  14. */
  15. public String parseToString(URL url) throws IOException, TikaException {
  16. Metadata metadata = new Metadata();
  17. InputStream stream = TikaInputStream.get(url, metadata);
  18. return parseToString(stream, metadata);
  19. }

代码示例来源:origin: rnewson/couchdb-lucene

  1. public void parse(final InputStream in, final String contentType, final String fieldName, final Document doc)
  2. throws IOException {
  3. final Metadata md = new Metadata();
  4. md.set(HttpHeaders.CONTENT_TYPE, contentType);
  5. try {
  6. // Add body text.
  7. doc.add(text(fieldName, tika.parseToString(in, md), false));
  8. } catch (final IOException e) {
  9. log.warn("Failed to index an attachment.", e);
  10. return;
  11. } catch (final TikaException e) {
  12. log.warn("Failed to parse an attachment.", e);
  13. return;
  14. }
  15. // Add DC attributes.
  16. addDublinCoreAttributes(md, doc);
  17. }

代码示例来源:origin: apache/tika

  1. /**
  2. * Parses the given file and returns the extracted text content.
  3. * <p>
  4. * To avoid unpredictable excess memory use, the returned string contains
  5. * only up to {@link #getMaxStringLength()} first characters extracted
  6. * from the input document. Use the {@link #setMaxStringLength(int)}
  7. * method to adjust this limitation.
  8. *
  9. * @param file the file to be parsed
  10. * @return extracted text content
  11. * @throws IOException if the file can not be read
  12. * @throws TikaException if the file can not be parsed
  13. * @see #parseToString(Path)
  14. */
  15. public String parseToString(File file) throws IOException, TikaException {
  16. Metadata metadata = new Metadata();
  17. @SuppressWarnings("deprecation")
  18. InputStream stream = TikaInputStream.get(file, metadata);
  19. return parseToString(stream, metadata);
  20. }

代码示例来源:origin: apache/tika

  1. public TrecDocument summarize(File file) throws FileNotFoundException,
  2. IOException, TikaException {
  3. Tika tika = new Tika();
  4. Metadata met = new Metadata();
  5. String contents = tika.parseToString(new FileInputStream(file), met);
  6. return new TrecDocument(met.get(TikaCoreProperties.RESOURCE_NAME_KEY), contents,
  7. met.getDate(TikaCoreProperties.CREATED));
  8. }

代码示例来源:origin: stackoverflow.com

  1. private void compareXlsx(File expected, File result) throws IOException, TikaException {
  2. Tika tika = new Tika();
  3. String expectedText = tika.parseToString(expected);
  4. String resultText = tika.parseToString(result);
  5. assertEquals(expectedText, resultText);
  6. }
  7. <dependency>
  8. <groupId>org.apache.tika</groupId>
  9. <artifactId>tika-parsers</artifactId>
  10. <version>1.13</version>
  11. <scope>test</scope>
  12. </dependency>

代码示例来源:origin: org.onehippo.cms7/hippo-cms-api

  1. private String doParse(final InputStream inputStream) {
  2. try {
  3. // tika parseToString already closes the inputStream
  4. return tika.parseToString(inputStream);
  5. } catch (TikaException e) {
  6. throw new IllegalStateException("Unexpected TikaException processing failure", e);
  7. } catch (IOException e) {
  8. throw new IllegalStateException("Unexpected IOException processing failure", e);
  9. }
  10. }

代码示例来源:origin: stackoverflow.com

  1. public String parseToStringExample() throws IOException, SAXException, TikaException
  2. {
  3. Tika tika = new Tika();
  4. try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
  5. return tika.parseToString(stream); // This should return you the pdf's text
  6. }
  7. }

代码示例来源:origin: stackoverflow.com

  1. File inputFile = ...;
  2. Tika tika = new Tika();
  3. String extractedText = tika.parseToString(inputFile);

代码示例来源:origin: org.apache.tika/tika-core

  1. /**
  2. * Parses the resource at the given URL and returns the extracted
  3. * text content.
  4. * <p>
  5. * To avoid unpredictable excess memory use, the returned string contains
  6. * only up to {@link #getMaxStringLength()} first characters extracted
  7. * from the input document. Use the {@link #setMaxStringLength(int)}
  8. * method to adjust this limitation.
  9. *
  10. * @param url the URL of the resource to be parsed
  11. * @return extracted text content
  12. * @throws IOException if the resource can not be read
  13. * @throws TikaException if the resource can not be parsed
  14. */
  15. public String parseToString(URL url) throws IOException, TikaException {
  16. Metadata metadata = new Metadata();
  17. InputStream stream = TikaInputStream.get(url, metadata);
  18. return parseToString(stream, metadata);
  19. }

代码示例来源:origin: stackoverflow.com

  1. Tika tika = new Tika();
  2. Metadata metadata = new Metadata();
  3. metadata.set(Metadata.RESOURCE_NAME_KEY, "myfile.name");
  4. String text = tika.parseToString(new File("myfile.name"));

代码示例来源:origin: com.github.lafa.tikaNoExternal/tika-core

  1. /**
  2. * Parses the file at the given path and returns the extracted text content.
  3. * <p>
  4. * To avoid unpredictable excess memory use, the returned string contains
  5. * only up to {@link #getMaxStringLength()} first characters extracted
  6. * from the input document. Use the {@link #setMaxStringLength(int)}
  7. * method to adjust this limitation.
  8. *
  9. * @param path the path of the file to be parsed
  10. * @return extracted text content
  11. * @throws IOException if the file can not be read
  12. * @throws TikaException if the file can not be parsed
  13. */
  14. public String parseToString(Path path) throws IOException, TikaException {
  15. Metadata metadata = new Metadata();
  16. InputStream stream = TikaInputStream.get(path, metadata);
  17. return parseToString(stream, metadata);
  18. }

代码示例来源:origin: org.apache.tika/tika-core

  1. /**
  2. * Parses the file at the given path and returns the extracted text content.
  3. * <p>
  4. * To avoid unpredictable excess memory use, the returned string contains
  5. * only up to {@link #getMaxStringLength()} first characters extracted
  6. * from the input document. Use the {@link #setMaxStringLength(int)}
  7. * method to adjust this limitation.
  8. *
  9. * @param path the path of the file to be parsed
  10. * @return extracted text content
  11. * @throws IOException if the file can not be read
  12. * @throws TikaException if the file can not be parsed
  13. */
  14. public String parseToString(Path path) throws IOException, TikaException {
  15. Metadata metadata = new Metadata();
  16. InputStream stream = TikaInputStream.get(path, metadata);
  17. return parseToString(stream, metadata);
  18. }

代码示例来源:origin: com.github.lafa.tikaNoExternal/tika-core

  1. /**
  2. * Parses the given file and returns the extracted text content.
  3. * <p>
  4. * To avoid unpredictable excess memory use, the returned string contains
  5. * only up to {@link #getMaxStringLength()} first characters extracted
  6. * from the input document. Use the {@link #setMaxStringLength(int)}
  7. * method to adjust this limitation.
  8. *
  9. * @param file the file to be parsed
  10. * @return extracted text content
  11. * @throws IOException if the file can not be read
  12. * @throws TikaException if the file can not be parsed
  13. * @see #parseToString(Path)
  14. */
  15. public String parseToString(File file) throws IOException, TikaException {
  16. Metadata metadata = new Metadata();
  17. @SuppressWarnings("deprecation")
  18. InputStream stream = TikaInputStream.get(file, metadata);
  19. return parseToString(stream, metadata);
  20. }

代码示例来源:origin: org.xwiki.platform/xwiki-platform-search-lucene-api

  1. private String getContentAsText(XWikiDocument doc, XWikiContext context)
  2. {
  3. String contentText = null;
  4. try {
  5. XWikiAttachment att = doc.getAttachment(this.filename);
  6. LOGGER.debug("Start parsing attachement [{}] in document [{}]", this.filename, doc.getDocumentReference());
  7. Tika tika = new Tika();
  8. Metadata metadata = new Metadata();
  9. metadata.set(Metadata.RESOURCE_NAME_KEY, this.filename);
  10. contentText = StringUtils.lowerCase(tika.parseToString(att.getContentInputStream(context), metadata));
  11. } catch (Throwable ex) {
  12. LOGGER.warn("error getting content of attachment [{}] for document [{}]",
  13. new Object[] {this.filename, doc.getDocumentReference(), ex});
  14. }
  15. return contentText;
  16. }
  17. }

相关文章