A versatile Go application that can work as both an HTTP server and a CLI tool, sharing the same core service logic.
- Dual Mode Operation: Run as HTTP server or CLI tool
- Shared Service Layer: Common business logic used by both interfaces
- Website Crawling: Extract HTML content from websites (non-JavaScript sites)
- Parallel Link Crawling: Discover and crawl all links on a webpage in parallel
- Echo HTTP Framework: Fast and lightweight HTTP server
- Cobra CLI: Feature-rich command-line interface
- Graceful Shutdown: Proper signal handling for HTTP server
- Health Checks: Built-in health monitoring
├── cmd/
│ ├── server/ # HTTP server entry point
│ └── cli/ # CLI entry point
├── internal/
│ ├── service/ # Core business logic
│ ├── http/ # HTTP handlers and server
│ └── cli/ # CLI commands
├── main.go # Unified entry point
├── go.mod
├── Makefile
└── README.md
- Clone the repository:
git clone <repository-url>
cd quicrawl- Install dependencies:
make depsStart the HTTP server:
# Using unified binary
go run main.go -mode=server -port=8080
# Or using dedicated server binary
make run-serverThe server will start on port 8080 (configurable) with the following endpoints:
GET /- API informationPOST /api/v1/process- Process inputPOST /api/v1/crawl- Crawl website and extract HTMLPOST /api/v1/crawl-parallel- Crawl website and all its links in parallel (with pagination)GET /api/v1/config- Get service configuration and limitsGET /api/v1/status- Get service statusGET /api/v1/health- Health check
Process input:
curl -X POST http://localhost:8080/api/v1/process \
-H "Content-Type: application/json" \
-d '{"input": "Hello World"}'Crawl website:
curl -X POST http://localhost:8080/api/v1/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'Crawl website and all links in parallel:
curl -X POST http://localhost:8080/api/v1/crawl-parallel \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "max_concurrency": 5}'Crawl with pagination:
curl -X POST http://localhost:8080/api/v1/crawl-parallel \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "page": 1, "page_size": 5}'Get service configuration:
curl http://localhost:8080/api/v1/configGet status:
curl http://localhost:8080/api/v1/statusHealth check:
curl http://localhost:8080/api/v1/healthUse the CLI tool:
# Using unified binary
go run main.go -mode=cli process "Hello World"
go run main.go -mode=cli status
go run main.go -mode=cli health
# Or using dedicated CLI binary
make run-cliprocess [input]- Process input using the servicecrawl [url]- Crawl website and extract HTML contentcrawl-parallel [url]- Crawl website and all its links in parallelstatus- Get service status and uptimehealth- Perform health check
Crawl a website:
go run cmd/cli/main.go crawl "https://example.com"Crawl and save to file:
go run cmd/cli/main.go crawl "https://example.com" --output page.htmlCrawl website and all links in parallel:
go run cmd/cli/main.go crawl-parallel "https://example.com" --concurrency 5Crawl parallel and save detailed results:
go run cmd/cli/main.go crawl-parallel "https://example.com" --concurrency 5 --output results.txtBuild all binaries:
make buildBuild specific binaries:
make build-server # HTTP server only
make build-cli # CLI only
make build-unified # Unified binaryA complete Postman collection is included for easy API testing:
- Import the collection:
Quicrawl_API.postman_collection.json - Import the environment:
Quicrawl_Environment.postman_environment.json - Set the environment: Select "Quicrawl Environment" in Postman
- Start the server:
go run cmd/server/main.go - Run requests: All endpoints are pre-configured with example requests
The collection includes:
- API Info endpoint
- Process Input endpoint with examples
- Crawl Website endpoint with test URLs
- Crawl Website Parallel endpoint with concurrency control and pagination
- Configuration endpoint to check service limits
- Service Status endpoint
- Health Check endpoint
- Example responses for success and error cases
The Postman environment includes:
base_url: Server URL (default: http://localhost:8080)test_url: Test URL for crawling (https://httpbin.org/html)example_url: Example URL for testing (https://example.com)parallel_test_url: Test URL for parallel crawling (https://httpbin.org/links/5/0)
To prevent "Maximum response size reached" errors, the service implements several limits:
- Max Response Size: 10MB total response size
- Max Links to Crawl: 20 links per parallel crawl
- Max HTML Size: 1MB per individual page
- Max Concurrency: 10 concurrent requests (configurable)
When limits are exceeded:
- HTTP API: Response is automatically truncated with metadata only
- CLI: Shows warning and suggests using
--outputflag for full results - Pagination: Use
pageandpage_sizeparameters to control response size
Run tests:
make testClean build artifacts:
make cleanThe application follows a clean architecture pattern:
- Service Layer (
internal/service/): Contains the core business logic - HTTP Layer (
internal/http/): HTTP handlers and server using Echo framework - CLI Layer (
internal/cli/): Command-line interface using Cobra - Entry Points: Multiple entry points for different use cases
Both the HTTP server and CLI use the same service layer, ensuring consistent behavior across interfaces.