Examples of ParserContainerExtractor

org.apache.tika.extractor.ParserContainerExtractor
An implementation of {@link ContainerExtractor} powered by theregular {@link Parser} classes.This allows you to easily extract out all the embedded resources from within contain files, whilst using the normal parsers to do the work. By default the {@link AutoDetectParser} will be used, to allowextraction from the widest range of containers.

Examples of org.apache.tika.extractor.ParserContainerExtractor

    /**
     * For office files which don't have anything embedded in them
     */
    @Test
    public void testWithoutEmbedded() throws Exception {
       ContainerExtractor extractor = new ParserContainerExtractor();
       
       String[] files = new String[] {
             "testEXCEL.xls", "testWORD.doc", "testPPT.ppt",
             "testVISIO.vsd", "test-outlook.msg"
       };

View Full Code Here

Examples of org.apache.tika.extractor.ParserContainerExtractor

     * Office files with embedded images, but no other
     *  office files in them
     */
    @Test
    public void testEmbeddedImages() throws Exception {
       ContainerExtractor extractor = new ParserContainerExtractor();
       TrackingHandler handler;
       
       // Excel with 1 image
       handler = process("testEXCEL_1img.xls", extractor, false);
       assertEquals(1, handler.filenames.size());

View Full Code Here

Examples of org.apache.tika.extractor.ParserContainerExtractor

     *       -> excel
     *           -> image
     */
    @Test
    public void testEmbeddedOfficeFiles() throws Exception {
       ContainerExtractor extractor = new ParserContainerExtractor();
       TrackingHandler handler;
       
       
       // Excel with a word doc and a powerpoint doc, both of which have images in them
       // Without recursion, should see both documents + the images

View Full Code Here

Examples of org.apache.tika.extractor.ParserContainerExtractor

       assertEquals(TYPE_PDF, handler.mediaTypes.get(1));
    }


    @Test
    public void testEmbeddedOfficeFilesXML() throws Exception {
        ContainerExtractor extractor = new ParserContainerExtractor();
        TrackingHandler handler;


        handler = process("EmbeddedDocument.docx", extractor, false);
        assertTrue(handler.filenames.contains("Microsoft_Office_Excel_97-2003_Worksheet1.bin"));
        assertEquals(2, handler.filenames.size());

View Full Code Here

Examples of org.apache.tika.extractor.ParserContainerExtractor

        assertEquals(2, handler.filenames.size());
    }


    @Test
    public void testPowerpointImages() throws Exception {
        ContainerExtractor extractor = new ParserContainerExtractor();
        TrackingHandler handler;


        handler = process("pictures.ppt", extractor, false);
        assertTrue(handler.mediaTypes.contains(new MediaType("image", "jpeg")));
        assertTrue(handler.mediaTypes.contains(new MediaType("image", "png")));

View Full Code Here

Examples of org.apache.tika.extractor.ParserContainerExtractor

    private ContainerExtractor extractor;
    
    @Before
    public void setUp() {
        Tika tika = new Tika();
        extractor = new ParserContainerExtractor(
                tika.getParser(), tika.getDetector());
    }

View Full Code Here

Examples of org.apache.tika.extractor.ParserContainerExtractor


    @Test
    public void testEmbedded() throws Exception {
        InputStream input = FictionBookParserTest.class.getResourceAsStream("/test-documents/test.fb2");
        try {
            ContainerExtractor extractor = new ParserContainerExtractor();
            TikaInputStream stream = TikaInputStream.get(input);


            assertEquals(true, extractor.isSupported(stream));


            // Process it
            AbstractPOIContainerExtractionTest.TrackingHandler handler = new AbstractPOIContainerExtractionTest.TrackingHandler();
            extractor.extract(stream, null, handler);


            assertEquals(2, handler.filenames.size());
        } finally {
            input.close();
        }

View Full Code Here

Examples of org.apache.tika.extractor.ParserContainerExtractor

       assertTrue(needle > pdfHaystack && pdfHaystack > outerHaystack);
       
       //plagiarized from POIContainerExtractionTest.  Thank you!
       TrackingHandler tracker = new TrackingHandler();
       TikaInputStream tis;
       ContainerExtractor ex = new ParserContainerExtractor();
       try{
          tis= TikaInputStream.get(getResourceAsStream("/test-documents/testPDFEmbeddingAndEmbedded.docx"));
          ex.extract(tis, ex, tracker);
       } finally {
          stream.close();
       }
       assertEquals(true, ex.isSupported(tis));
       assertEquals(3, tracker.filenames.size());
       assertEquals(3, tracker.mediaTypes.size());
       assertEquals("image1.emf", tracker.filenames.get(0));
       assertNull(tracker.filenames.get(1));
       assertEquals("My first attachment", tracker.filenames.get(2));

View Full Code Here

Examples of org.apache.tika.extractor.ParserContainerExtractor

        //"regressiveness" exists only in Unit10.doc not in the container pdf document
        assertTrue(xml.contains("regressiveness"));


        TrackingHandler tracker = new TrackingHandler();
        TikaInputStream tis = null;
        ContainerExtractor ex = new ParserContainerExtractor();
        try{
            tis= TikaInputStream.get(
                getResourceAsStream("/test-documents/testPDF_childAttachments.pdf"));
            ex.extract(tis, ex, tracker);
        } finally {
            if (tis != null){
                tis.close();
            }
        }

View Full Code Here

Examples of org.apache.tika.extractor.ParserContainerExtractor

     * Check the Rtf and Attachments are returned
     *  as expected
     */
   @Test
    public void testBodyAndAttachments() throws Exception {
       ContainerExtractor extractor = new ParserContainerExtractor();
       
       // Process it with recursing
       // Will have the message body RTF and the attachments
       TrackingHandler handler = process(file, extractor, true);
       assertEquals(6, handler.filenames.size());

View Full Code Here

0 1 2 3 4 5

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.