A simple free text search engine service.
- index - given a string, tokenize it and store it along with its tokens
- search - given a string, tokenize it and use a similarity function to rank how similar it is to previously indexed strings
- This is a very basic implementation, where searching is likely not to scale well
- Moreover, as the store becomes larger it may become impossible to store it in-memory. This basic implementation does not address this issue
- WhitespaceTokenizer
- CharacterTokenizer
- SharedTokensSimilarity
- LengthPercentSimilarity
- Maven - Dependency Management
- Spark Java - A lightweight http framework for java
- slf4j - A logging facade, allows easier transition between logging implementations
- slf4j-log4j12 - A binding of slf4j to log4j
- Gson - A Java serialization/deserialization library to convert Java Objects into JSON and back
- Running this project requires maven and java (8+)
- Clone (or download) the project
- Run
mvn dependency:copy-dependencies installin the projects folder - Run
java -cp "target\FreeTextSearch-0.0.1-SNAPSHOT.jar;target\dependency\*" app.FreeTextSearch- This may be slightly different on different operating systems
- By default, step 4 will run the server with a character tokenizer and length percent similarity function
- You may choose to use a different tokenizer / similarity function
- To do that follow this example:
java -cp "target\FreeTextSearch-0.0.1-SNAPSHOT.jar;target\dependency\*" app.FreeTextSearch word sharedtokens - "word" (case doesn't matter) will start up the service with a word tokenizer (other values will start up the service with a character tokenizer)
- "sharedtokens" (case doesn't matter) will start up the service with a shared tokens similarity function (other values will start up the service with a length percent similarity function)
- The service should be up and ready to receive http requests now
Assuming you've started the service with a word tokenizer and a shared tokens similarity function, the following series of http requests:
curl -d "the cow says moo" -X POST http://127.0.0.1:5000/index
curl -d "the cat and the hat" -X POST http://127.0.0.1:5000/index
curl -d " the dish ran away with the spoon" -X POST http://127.0.0.1:5000/index
curl -d " a cat ran away" -X POST http://127.0.0.1:5000/search
should yield the following output:
[
{
"matches": [
{
"matchedAgainst": "the dish ran away with the spoon",
"similarity": 2
},
{
"matchedAgainst": "the cat and the hat",
"similarity": 1
},
{
"matchedAgainst": "the cow says moo",
"similarity": 0
}
]
}
]