FreeTextSearch

A simple free text search engine service.

Operations

index - given a string, tokenize it and store it along with its tokens
search - given a string, tokenize it and use a similarity function to rank how similar it is to previously indexed strings

Notes

This is a very basic implementation, where searching is likely not to scale well
Moreover, as the store becomes larger it may become impossible to store it in-memory. This basic implementation does not address this issue

Implemented tokenizers

WhitespaceTokenizer
CharacterTokenizer

Implemented similarity functions

SharedTokensSimilarity
LengthPercentSimilarity

Built with

Maven - Dependency Management
Spark Java - A lightweight http framework for java
slf4j - A logging facade, allows easier transition between logging implementations
slf4j-log4j12 - A binding of slf4j to log4j
Gson - A Java serialization/deserialization library to convert Java Objects into JSON and back

Getting started

Running this project requires maven and java (8+)
Clone (or download) the project
Run mvn dependency:copy-dependencies install in the projects folder
Run java -cp "target\FreeTextSearch-0.0.1-SNAPSHOT.jar;target\dependency\*" app.FreeTextSearch
1. This may be slightly different on different operating systems
By default, step 4 will run the server with a character tokenizer and length percent similarity function
1. You may choose to use a different tokenizer / similarity function
2. To do that follow this example: java -cp "target\FreeTextSearch-0.0.1-SNAPSHOT.jar;target\dependency\*" app.FreeTextSearch word sharedtokens
3. "word" (case doesn't matter) will start up the service with a word tokenizer (other values will start up the service with a character tokenizer)
4. "sharedtokens" (case doesn't matter) will start up the service with a shared tokens similarity function (other values will start up the service with a length percent similarity function)
The service should be up and ready to receive http requests now

Example of using the service

Assuming you've started the service with a word tokenizer and a shared tokens similarity function, the following series of http requests:

curl -d "the cow says moo" -X POST http://127.0.0.1:5000/index
curl -d "the cat and the hat" -X POST http://127.0.0.1:5000/index
curl -d " the dish ran away with the spoon" -X POST http://127.0.0.1:5000/index
curl -d " a cat ran away" -X POST http://127.0.0.1:5000/search

should yield the following output:

[
  {
    "matches": [
      {
        "matchedAgainst": "the dish ran away with the spoon",
        "similarity": 2
      },
      {
        "matchedAgainst": "the cat and the hat",
        "similarity": 1
      },
      {
        "matchedAgainst": "the cow says moo",
        "similarity": 0
      }
    ]
  }
]

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.settings		.settings
src/main		src/main
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FreeTextSearch

Operations

Notes

Implemented tokenizers

Implemented similarity functions

Built with

Getting started

Example of using the service

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FreeTextSearch

Operations

Notes

Implemented tokenizers

Implemented similarity functions

Built with

Getting started

Example of using the service

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages