Quicrawl

A versatile Go application that can work as both an HTTP server and a CLI tool, sharing the same core service logic.

Features

Dual Mode Operation: Run as HTTP server or CLI tool
Shared Service Layer: Common business logic used by both interfaces
Website Crawling: Extract HTML content from websites (non-JavaScript sites)
Parallel Link Crawling: Discover and crawl all links on a webpage in parallel
Echo HTTP Framework: Fast and lightweight HTTP server
Cobra CLI: Feature-rich command-line interface
Graceful Shutdown: Proper signal handling for HTTP server
Health Checks: Built-in health monitoring

Project Structure

├── cmd/
│   ├── server/          # HTTP server entry point
│   └── cli/             # CLI entry point
├── internal/
│   ├── service/         # Core business logic
│   ├── http/            # HTTP handlers and server
│   └── cli/             # CLI commands
├── main.go              # Unified entry point
├── go.mod
├── Makefile
└── README.md

Installation

Clone the repository:

git clone <repository-url>
cd quicrawl

Install dependencies:

make deps

Usage

HTTP Server Mode

Start the HTTP server:

# Using unified binary
go run main.go -mode=server -port=8080

# Or using dedicated server binary
make run-server

The server will start on port 8080 (configurable) with the following endpoints:

GET / - API information
POST /api/v1/process - Process input
POST /api/v1/crawl - Crawl website and extract HTML
POST /api/v1/crawl-parallel - Crawl website and all its links in parallel (with pagination)
GET /api/v1/config - Get service configuration and limits
GET /api/v1/status - Get service status
GET /api/v1/health - Health check

Example API Usage

Process input:

curl -X POST http://localhost:8080/api/v1/process \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello World"}'

Crawl website:

curl -X POST http://localhost:8080/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Crawl website and all links in parallel:

curl -X POST http://localhost:8080/api/v1/crawl-parallel \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "max_concurrency": 5}'

Crawl with pagination:

curl -X POST http://localhost:8080/api/v1/crawl-parallel \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "page": 1, "page_size": 5}'

Get service configuration:

curl http://localhost:8080/api/v1/config

Get status:

curl http://localhost:8080/api/v1/status

Health check:

curl http://localhost:8080/api/v1/health

CLI Mode

Use the CLI tool:

# Using unified binary
go run main.go -mode=cli process "Hello World"
go run main.go -mode=cli status
go run main.go -mode=cli health

# Or using dedicated CLI binary
make run-cli

Available CLI Commands

process [input] - Process input using the service
crawl [url] - Crawl website and extract HTML content
crawl-parallel [url] - Crawl website and all its links in parallel
status - Get service status and uptime
health - Perform health check

CLI Examples

Crawl a website:

go run cmd/cli/main.go crawl "https://example.com"

Crawl and save to file:

go run cmd/cli/main.go crawl "https://example.com" --output page.html

Crawl website and all links in parallel:

go run cmd/cli/main.go crawl-parallel "https://example.com" --concurrency 5

Crawl parallel and save detailed results:

go run cmd/cli/main.go crawl-parallel "https://example.com" --concurrency 5 --output results.txt

Building

Build all binaries:

make build

Build specific binaries:

make build-server    # HTTP server only
make build-cli       # CLI only
make build-unified   # Unified binary

API Testing

Postman Collection

A complete Postman collection is included for easy API testing:

Import the collection: Quicrawl_API.postman_collection.json
Import the environment: Quicrawl_Environment.postman_environment.json
Set the environment: Select "Quicrawl Environment" in Postman
Start the server: go run cmd/server/main.go
Run requests: All endpoints are pre-configured with example requests

The collection includes:

API Info endpoint
Process Input endpoint with examples
Crawl Website endpoint with test URLs
Crawl Website Parallel endpoint with concurrency control and pagination
Configuration endpoint to check service limits
Service Status endpoint
Health Check endpoint
Example responses for success and error cases

Environment Variables

The Postman environment includes:

base_url: Server URL (default: http://localhost:8080)
test_url: Test URL for crawling (https://httpbin.org/html)
example_url: Example URL for testing (https://example.com)
parallel_test_url: Test URL for parallel crawling (https://httpbin.org/links/5/0)

Response Size Limits

To prevent "Maximum response size reached" errors, the service implements several limits:

Max Response Size: 10MB total response size
Max Links to Crawl: 20 links per parallel crawl
Max HTML Size: 1MB per individual page
Max Concurrency: 10 concurrent requests (configurable)

When limits are exceeded:

HTTP API: Response is automatically truncated with metadata only
CLI: Shows warning and suggests using --output flag for full results
Pagination: Use page and page_size parameters to control response size

Development

Run tests:

make test

Clean build artifacts:

make clean

Architecture

The application follows a clean architecture pattern:

Service Layer (internal/service/): Contains the core business logic
HTTP Layer (internal/http/): HTTP handlers and server using Echo framework
CLI Layer (internal/cli/): Command-line interface using Cobra
Entry Points: Multiple entry points for different use cases

Both the HTTP server and CLI use the same service layer, ensuring consistent behavior across interfaces.

Dependencies

Echo - HTTP web framework
Cobra - CLI framework
Go 1.21+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quicrawl

Features

Project Structure

Installation

Usage

HTTP Server Mode

Example API Usage

CLI Mode

Available CLI Commands

CLI Examples

Building

API Testing

Postman Collection

Environment Variables

Response Size Limits

Development

Architecture

Dependencies

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.cursor/rules		.cursor/rules
cmd		cmd
internal		internal
CONTRIBUTING_FOR_LLMS.md		CONTRIBUTING_FOR_LLMS.md
Makefile		Makefile
Quicrawl_API.postman_collection.json		Quicrawl_API.postman_collection.json
Quicrawl_Environment.postman_environment.json		Quicrawl_Environment.postman_environment.json
README.md		README.md
example_aggregate_output.md		example_aggregate_output.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_test.go		main_test.go
quicrawl		quicrawl

ja1code/quicrawl

Folders and files

Latest commit

History

Repository files navigation

Quicrawl

Features

Project Structure

Installation

Usage

HTTP Server Mode

Example API Usage

CLI Mode

Available CLI Commands

CLI Examples

Building

API Testing

Postman Collection

Environment Variables

Response Size Limits

Development

Architecture

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages