Maven and JUnit Project - Extracting Content and Metadata via Apache Tika
Last Updated :
26 Apr, 2025
In the software industry, contents are transferred or portable via various documents with formats like TXT, XLS, or PDF or even sometimes with MP4 format also. As multiple formats are in use, there should be a common way to extract the content and metadata from them. It can be possible via Apache Tika, a powerful versatile library for content analysis. As an introduction, let us see how parsing can be done and get the contents, nature of the document, etc., by going through the features of Apache Tikka. Via a sample maven project, let us see them.
Example Maven Project
Project Structure:
Â
First and foremost thing is we need to see the dependencies required for Apache Tika and those need to be specified in pom.xml
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.17</version>
</dependency>
For our project, let us see all the dependencies viaÂ
pom.xml
XML
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<artifactId>apache-tika</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>apache-tika</name>
<parent>
<groupId>com.gfg</groupId>
<artifactId>parent-modules</artifactId>
<version>1.0.0-SNAPSHOT</version>
</parent>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>${tika.version}</version>
</dependency>
</dependencies>
<properties>
<tika.version>1.17</tika.version>
</properties>
</project>
Heart of apache tika is Parser API. While parsing the documents mostly Apache POI or PDFBox will be usedÂ
void parse(
InputStream inputStream, // This is the input document that need to be parsed
ContentHandler contentHandler, // handler by processing export the result in a particular form
Metadata metadata, //metadata properties
ParseContext parseContext // for customizing parsing process)
throws IOException, SAXException, TikaException
Document type detection can be done by using an implementation class of the Detector interface. The below-mentioned method is available here
MediaType detect(java.io.InputStream inputStream, Metadata metadata) throws IOException
Language detection can also be done by Tika and identification of language is done without the help of metadata information. Now via the sample project java file contents, let us cover the topic as well
SampleTikaAnalysis.java
In this program, the following ways are handled both in detector and facade pattern
- Detecting document type
- Extracting the content using a parser and facade
- Extracting the metadata using a parser and facade
Java
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class SampleTikaAnalysis {
// Detecting the document type by using Detector
public static String detectingTheDocTypeByUsingDetector(InputStream inputStream) throws IOException {
Detector detector = new DefaultDetector();
Metadata metadata = new Metadata();
MediaType mediaType = detector.detect(inputStream, metadata);
return mediaType.toString();
}
// Detecting the document type by using Facade
public static String detectDocTypeUsingFacade(InputStream inputStream) throws IOException {
Tika tika = new Tika();
String mediaType = tika.detect(inputStream);
return mediaType;
}
public static String extractContentUsingParser(InputStream inputStream) throws IOException, TikaException, SAXException {
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(inputStream, contentHandler, metadata, context);
return contentHandler.toString();
}
public static String extractContentUsingFacade(InputStream inputStream) throws IOException, TikaException {
Tika tika = new Tika();
String content = tika.parseToString(inputStream);
return content;
}
public static Metadata extractMetadatatUsingParser(InputStream inputStream) throws IOException, SAXException, TikaException {
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(inputStream, contentHandler, metadata, context);
return metadata;
}
public static Metadata extractMetadatatUsingFacade(InputStream inputStream) throws IOException, TikaException {
Tika tika = new Tika();
Metadata metadata = new Metadata();
tika.parse(inputStream, metadata);
return metadata;
}
}
Let us test the above concepts by taking 3 documents namely exceldocument.xlsx, pdfdocument.txt, and worddocument.docx. They should be available under the test/resources folder so that they can be read from the mentioned way in the code. Let us test the contents now viaÂ
SampleTikaWayUnitTest.java
Java
import static org.hamcrest.CoreMatchers.containsString;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertThat;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.junit.Test;
import org.xml.sax.SAXException;
public class SampleTikaWayUnitTest {
@Test
public void withDetectorFindingTheResultTypeAsDocumentType() throws IOException {
InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("pdfdocument.txt");
String resultantMediaType = SampleTikaAnalysis.detectingTheDocTypeByUsingDetector(inputStream);
assertEquals("application/pdf", resultantMediaType);
inputStream.close();
}
@Test
public void withFacadeFindingTheResultTypeAsDocumentType() throws IOException {
InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("pdfdocument.txt");
String resultantMediaType = SampleTikaAnalysis.detectDocTypeUsingFacade(inputStream);
assertEquals("application/pdf", resultantMediaType);
inputStream.close();
}
@Test
public void byUsingParserAndGettingContent() throws IOException, TikaException, SAXException {
InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("worddocument.docx");
String documentContent = SampleTikaAnalysis.extractContentUsingParser(inputStream);
assertThat(documentContent, containsString("OpenSource REST API URL"));
assertThat(documentContent, containsString("Spring MVC"));
inputStream.close();
}
@Test
public void byUsingFacadeAndGettingContent() throws IOException, TikaException {
InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("worddocument.docx");
String documentContent = SampleTikaAnalysis.extractContentUsingFacade(inputStream);
assertThat(documentContent, containsString("OpenSource REST API URL"));
assertThat(documentContent, containsString("Spring MVC"));
inputStream.close();
}
@Test
public void byUsingParserAndGettingMetadata() throws IOException, TikaException, SAXException {
InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("exceldocument.xlsx");
Metadata retrieveMetadata = SampleTikaAnalysis.extractMetadatatUsingParser(inputStream);
assertEquals("org.apache.tika.parser.DefaultParser", retrieveMetadata.get("X-Parsed-By"));
assertEquals("Microsoft Office User", retrieveMetadata.get("Author"));
inputStream.close();
}
@Test
public void byUsingFacadeAndGettingMetadata() throws IOException, TikaException {
InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("exceldocument.xlsx");
Metadata retrieveMetadata = SampleTikaAnalysis.extractMetadatatUsingFacade(inputStream);
assertEquals("org.apache.tika.parser.DefaultParser", retrieveMetadata.get("X-Parsed-By"));
assertEquals("Microsoft Office User", retrieveMetadata.get("Author"));
inputStream.close();
}
}
Output of JUnit test case:
 - Test withDetectorFindingTheResultTypeAsDocumentType -> It is finding the document type by detector class and asserting the resultant document type to be pdf.
- Test withFacadeFindingTheResultTypeAsDocumentType -> It is finding the document type by facade class and asserting the resultant document type to be pdf.
- Test byUsingParserAndGettingContent -> By parsing and extracting the available word document in the mentioned path and asserting the resultant text.Â
- Test byUsingFacadeAndGettingContent -> By facade class, extracting the available word document in the mentioned path and asserting the resultant text.Â
- Test byUsingParserAndGettingMetadata -> By parsing and extracting the available excel  document in the mentioned path and getting the metadata and asserting that
- Test byUsingFacadeAndGettingMetadata -> By facade class, extracting the available excel  document in the mentioned path and getting the metadata, and asserting that
Conclusion
Apache Tika is a wonderful content analysis versatile library used across the software industry for multiple purposes.
Similar Reads
Creating Java Project Without Maven in Apache NetBeans (11 and Higher)
In this article, you will find the steps to create a simple Java Project without Maven in Apache NetBeans (v 11 and higher). Since maven has launched, the creation of Java Project directly has been shifted under the "Java with Ant", due to which the geeks who upgrade themselves to the new Apache Net
3 min read
How to Create an Apache Kafka Project in IntelliJ using Java and Maven?
Apache Kafka allows you to decouple your data streams and systems. So the idea is that the source systems will be responsible for sending their data into Apache Kafka. Then any target systems that want to get access to this data feed this data stream will have to query and read from Apache Kafka to
3 min read
JUnit - Testcases for Credit Card Validation as a Maven Project
In this article let us see a much-required functionality that is credit card validation as well as its relevant JUnit test cases for given credit card numbers. In an e-commerce project, credit card payment is the valid and much desired required functionality. We need to have a proper validation mech
6 min read
JUnit - Inventory Stock Testing as a Maven Project
The quality of the software is very important. Though Unit tests and integration tests are done in the manual testing way, we cannot expect all kinds of scenarios to test. Hence as a testing mechanism, we can test a software code or project by means of JUnit. In this article let us see how to do tes
4 min read
Spring Boot Integration With MongoDB as a Maven Project
MongoDB is a NoSQL database and it is getting used in software industries a lot because there is no strict schema like RDBMS that needs to be observed. It is a document-based model and less hassle in the structure of the collection. In this article let us see how it gets used with SpringBoot as a Ma
4 min read
Maven Project - HashMap and HashSet Collections with JUnit Testing
In many software, we will be working with HashMap and HashSet and always there is confusion exists that when to use HashMap and when to use HashSet. As a sample project, a use case containing the "GeekAuthor" class is taken and it has different attributes namely authorNameauthorIdareaOfInterestpubl
7 min read
Spring Boot Integration With MySQL as a Maven Project
Spring Boot is trending and it is an extension of the spring framework but it reduces the huge configuration settings that need to be set in a spring framework. In terms of dependencies, it reduces a lot and minimized the dependency add-ons. It extends maximum support to all RDBMS databases like MyS
4 min read
JUnit Testcases Preparation For a Maven Project
The quality of a software project is the desired aspect always. Programmers write code and they will do unit testing, block level testing, and sometimes integration testing too. But they will provide a few examples alone for the use cases. For example, if they are finding a number is prime or not, m
4 min read
JUnit for Armstrong, Unique and Perfect Numbers as a Maven Project
The quality of software can be enhanced by writing automated testing. JUnit plays a significant role in testing the software. In this article, let us see a few samples of magic numbers like Armstrong, Unique, and Perfect Numbers and their corresponding JUnit as maven project. Maven Example Project S
4 min read
Working with Microsoft Excel Using Apache POI and JExcel API with a Maven Project in Java
In the software industry, information needs to be portable and hence any valid data is available in XLS and XLSX formats i.e. excel formats. In order to communicate between excel formats and java, we need to use Apache POI and JExcel APIs. Let us see in this article how to use them via a sample mave
7 min read