Skip to content

ja1code/quicrawl

Repository files navigation

Quicrawl

A versatile Go application that can work as both an HTTP server and a CLI tool, sharing the same core service logic.

Features

  • Dual Mode Operation: Run as HTTP server or CLI tool
  • Shared Service Layer: Common business logic used by both interfaces
  • Website Crawling: Extract HTML content from websites (non-JavaScript sites)
  • Parallel Link Crawling: Discover and crawl all links on a webpage in parallel
  • Echo HTTP Framework: Fast and lightweight HTTP server
  • Cobra CLI: Feature-rich command-line interface
  • Graceful Shutdown: Proper signal handling for HTTP server
  • Health Checks: Built-in health monitoring

Project Structure

├── cmd/
│   ├── server/          # HTTP server entry point
│   └── cli/             # CLI entry point
├── internal/
│   ├── service/         # Core business logic
│   ├── http/            # HTTP handlers and server
│   └── cli/             # CLI commands
├── main.go              # Unified entry point
├── go.mod
├── Makefile
└── README.md

Installation

  1. Clone the repository:
git clone <repository-url>
cd quicrawl
  1. Install dependencies:
make deps

Usage

HTTP Server Mode

Start the HTTP server:

# Using unified binary
go run main.go -mode=server -port=8080

# Or using dedicated server binary
make run-server

The server will start on port 8080 (configurable) with the following endpoints:

  • GET / - API information
  • POST /api/v1/process - Process input
  • POST /api/v1/crawl - Crawl website and extract HTML
  • POST /api/v1/crawl-parallel - Crawl website and all its links in parallel (with pagination)
  • GET /api/v1/config - Get service configuration and limits
  • GET /api/v1/status - Get service status
  • GET /api/v1/health - Health check

Example API Usage

Process input:

curl -X POST http://localhost:8080/api/v1/process \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello World"}'

Crawl website:

curl -X POST http://localhost:8080/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Crawl website and all links in parallel:

curl -X POST http://localhost:8080/api/v1/crawl-parallel \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "max_concurrency": 5}'

Crawl with pagination:

curl -X POST http://localhost:8080/api/v1/crawl-parallel \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "page": 1, "page_size": 5}'

Get service configuration:

curl http://localhost:8080/api/v1/config

Get status:

curl http://localhost:8080/api/v1/status

Health check:

curl http://localhost:8080/api/v1/health

CLI Mode

Use the CLI tool:

# Using unified binary
go run main.go -mode=cli process "Hello World"
go run main.go -mode=cli status
go run main.go -mode=cli health

# Or using dedicated CLI binary
make run-cli

Available CLI Commands

  • process [input] - Process input using the service
  • crawl [url] - Crawl website and extract HTML content
  • crawl-parallel [url] - Crawl website and all its links in parallel
  • status - Get service status and uptime
  • health - Perform health check

CLI Examples

Crawl a website:

go run cmd/cli/main.go crawl "https://example.com"

Crawl and save to file:

go run cmd/cli/main.go crawl "https://example.com" --output page.html

Crawl website and all links in parallel:

go run cmd/cli/main.go crawl-parallel "https://example.com" --concurrency 5

Crawl parallel and save detailed results:

go run cmd/cli/main.go crawl-parallel "https://example.com" --concurrency 5 --output results.txt

Building

Build all binaries:

make build

Build specific binaries:

make build-server    # HTTP server only
make build-cli       # CLI only
make build-unified   # Unified binary

API Testing

Postman Collection

A complete Postman collection is included for easy API testing:

  1. Import the collection: Quicrawl_API.postman_collection.json
  2. Import the environment: Quicrawl_Environment.postman_environment.json
  3. Set the environment: Select "Quicrawl Environment" in Postman
  4. Start the server: go run cmd/server/main.go
  5. Run requests: All endpoints are pre-configured with example requests

The collection includes:

  • API Info endpoint
  • Process Input endpoint with examples
  • Crawl Website endpoint with test URLs
  • Crawl Website Parallel endpoint with concurrency control and pagination
  • Configuration endpoint to check service limits
  • Service Status endpoint
  • Health Check endpoint
  • Example responses for success and error cases

Environment Variables

The Postman environment includes:

Response Size Limits

To prevent "Maximum response size reached" errors, the service implements several limits:

  • Max Response Size: 10MB total response size
  • Max Links to Crawl: 20 links per parallel crawl
  • Max HTML Size: 1MB per individual page
  • Max Concurrency: 10 concurrent requests (configurable)

When limits are exceeded:

  • HTTP API: Response is automatically truncated with metadata only
  • CLI: Shows warning and suggests using --output flag for full results
  • Pagination: Use page and page_size parameters to control response size

Development

Run tests:

make test

Clean build artifacts:

make clean

Architecture

The application follows a clean architecture pattern:

  1. Service Layer (internal/service/): Contains the core business logic
  2. HTTP Layer (internal/http/): HTTP handlers and server using Echo framework
  3. CLI Layer (internal/cli/): Command-line interface using Cobra
  4. Entry Points: Multiple entry points for different use cases

Both the HTTP server and CLI use the same service layer, ensuring consistent behavior across interfaces.

Dependencies

  • Echo - HTTP web framework
  • Cobra - CLI framework
  • Go 1.21+

About

WIP | LLM friendly crawler that returns whole site content in a single request

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published