hadoop/通用java

1qczuiv0  于 2021-06-04  发布在  Hadoop
关注(0)|答案(0)|浏览(238)

我的目标是将ms word、pdf等文档加载到hdfs中,从每个文档中提取特定的“内容”,并进一步使用这些内容进行分析。
我认为像tika这样的库可以在mr。
doc一词的部分内容。具体如下:

6.  Statement of Strategy 
We have 4 strategic interventions that will deliver a competitive advantage.
 Innovate upstream and downstream
1.  Biopulp.
We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
2.  Mandrel Case Forming
We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.

Supplier Strategy for Competition
3.  Competition in practice 
We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
4.  Cost input pass-through
Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration

Optimum Sourcing Principles for Corrugates

<A TABLE HERE>

7.  Tactical Planning and Execution

<A TABLE HERE>

假设我想做以下工作:
提取“瓦楞纸板最佳采购原则”下的表格
“上下游创新”的要点
虽然这看起来很疯狂和荒谬,但我想知道tika(尝试过这个,但坚持只使用元数据和字符串形式的文件)、lucene/solr、poi等是否有助于解析和“理解”单词、pdf文档,并允许基于某些搜索字符串(或regex)提取数据块。
例如,我使用了tika解析器,得到了以下太幼稚的输出('a table here',即word doc中的一个表。解释为段落!)要帮助提取内容,请执行以下操作:

6.  Statement of Strategy 
We have 4 strategic interventions that will deliver a competitive advantage to P&G.
 Innovate upstream and downstream
Biopulp.
We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
Mandrel Case Forming
We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.

Supplier Strategy for Competition
Competition in practice 
We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
Cost input pass-through
Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration.

Optimum Sourcing Principles for Corrugates
    principle
    optimum
    rationale

    Number of  suppliers
    2-3 per plant
>80% with 5 per region/country cluster
    Competition is local
Scale the spend with central accounts

    Global/local suppliers
    Regional is sufficient
    No advantage to global as scale is regional only and there is limited IP to transfer.
Larger regional suppliers can consolidate local single-plant suppliers to make it efficient for us. They also bring capital for machinery upgrading and scale for paper source.

    Approach to suppliers
    collaborative
    Competition to drive price is clear; preferential and value-add deals require collaboration

    Make v buy
    buy
    Multiple suppliers; commoditised technologies

    Distance of suppliers to plant
    Max 300km for boxes (300miles in NA); up to 1000km for paper reels.
Can be longer for specialist print grades or to countries with no high quality local supply
    Economic max as high volume product (air in the fluting)
Need recent built paper machines to produce paper strong enough to run on high-speed corrugators

    Type of suppliers
    Integrated with containerboard making

Corrugators on-site
    To assure supply and avoid being leveraged by paper making scale
Cost structure not competitive if have to buy in board (shipping air)

    Purchase of feedstocks
    Not if integrated suppliers
    Integrated suppliers have 20x our scale

    Length and nature of contracts
    Multiple year (2-3), but with fixed glidepath pricing/value every year
    Significant effort for Purchases to re-enquire annually. High number of specs and low resources mean long time to qualify relative to additional value if only 12 month allocation.

    Specifications
    Standard board weights

Tailored box sizes
    Paper scale much higher so uneconomic to make tailored weight
Maximising pallet fit delivers better savings and stronger pallet (less transport damages) than scale savings of standard box size.

    Terms
    Standard, including payment terms
    High degree of competition, no specialist investment. Paper making has good cash-flow, so no need for shorter payment terms.

下面是我编写的示例tika代码(当不同类型(pdf、msword等)的文档到达时,我不知道如何执行上述操作)

private void parseFileForContent(String absolutePath) throws IOException,
            SAXException, TikaException {
        // TODO Auto-generated method stub

        System.out.println("absolutePath : " + absolutePath);

        Tika tika = new Tika();

        File path = new File(absolutePath);

        if (path.isDirectory()) {

            File[] files = path.listFiles();

            for (File file : files) {

                System.out.println("File type is " + tika.detect(file));
            }
        } else {
            System.out.println("File type is " + tika.detect(path));

            Parser parser = new AutoDetectParser();

            ContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();

            parser.parse(TikaInputStream.get(path), handler, metadata,
                    new ParseContext());

            //displayMetadata(metadata);

            System.out.println("Handler "+handler.toString());
        }

    }

我希望使用tika,因为apachepoi仅限于ms文档,但我可以使用poi做一些合理的事情,比如提取段落、表格等。

package com.lnt.sap.sp2.scratchpad;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;

import org.apache.poi.xwpf.usermodel.IBodyElement;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFTable;
import org.apache.poi.xwpf.usermodel.XWPFTableCell;
import org.apache.poi.xwpf.usermodel.XWPFTableRow;

public class POIScratchpad {

    public static void main(String[] args) {
        // TODO Auto-generated method stub

        String absolutePath = args[0];

        POIScratchpad poiScratchpad = new POIScratchpad();

        poiScratchpad.parseMSDocuments(absolutePath);
    }

    private void parseMSDocuments(String absolutePath) {
        // TODO Auto-generated method stub

        try {

            XWPFDocument doc = new XWPFDocument(new FileInputStream(
                    absolutePath));

            displayElements(doc);
            // displayParagraphs(doc);
            // displayTables(doc);

        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    private void displayElements(XWPFDocument doc) {
        // TODO Auto-generated method stub

        java.util.Iterator<IBodyElement> bodyElementIterator = doc
                .getBodyElementsIterator();

        int cnt = 0;

        while (bodyElementIterator.hasNext()) {
            IBodyElement element = bodyElementIterator.next();

            System.out.println("**********" + cnt + "**********");

            System.out.println("Element type is " + element.getElementType());
            System.out.println("Part is : " + element.getPart());
            System.out.println("Part Type is : " + element.getPartType());
            System.out.println("Body is : " + element.getBody());
            System.out.println("element is " + element);

            System.out.println("**********");

            cnt++;
        }
    }

    private void displayParagraphs(XWPFDocument doc) {
        // TODO Auto-generated method stub
        List<XWPFParagraph> paragraphs = doc.getParagraphs();

        int cnt = 0;

        for (XWPFParagraph paragraph : paragraphs) {

            System.out.println("**********" + cnt + "**********");
            System.out.println(paragraph.getParagraphText());
            System.out.println("********************");

            cnt++;
        }
    }

    private void displayTables(XWPFDocument doc) {
        // TODO Auto-generated method stub

        Iterator<XWPFTable> tableIterator = doc.getTablesIterator();

        int cnt = 0;

        while (tableIterator.hasNext()) {

            XWPFTable table = tableIterator.next();

            System.out.println("**********" + cnt + "**********");

            List<XWPFTableRow> rows = table.getRows();

            for (XWPFTableRow row : rows) {

                List<XWPFTableCell> cells = row.getTableCells();

                for (XWPFTableCell cell : cells) {
                    System.out.println(cell.getText());
                }
            }

            System.out.println("********************");

            cnt++;
        }
    }
}

我该怎么做?我的假设在哪里?或者需要文档中的更多信息?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题