A scalable Python application for scraping job data from career sites using Jina.ai's Reader API. The application saves job listings in Markdown format with AWS Bedrock-compatible metadata, ready for S3 deployment.
- Site-Specific Scrapers: Currently supports Paysera, Revolut, and DOU.UA (Ukrainian job board with 324 companies)
- Jina.ai Integration: Leverages Jina.ai's powerful Reader API for reliable content extraction
- Dynamic Content Support: Handles JavaScript-rendered pages (e.g., Revolut's "Show more" functionality)
- AWS Bedrock Ready: Generates markdown files with Bedrock-compatible frontmatter (max 10 attributes)
- Markdown Output: Clean, readable Markdown files for each job listing with YAML frontmatter
- Concurrent Processing: Configurable parallel job processing (default: 10 concurrent jobs)
- Async/Await Architecture: High-performance async processing with semaphore-based rate limiting
- S3 Deployment: Native AWS S3 integration with sync and dry-run modes
- Extensible Design: SOLID principles and clean architecture for easy extension
- Code Quality: Automated linting with Ruff, type checking with mypy, security scanning with Bandit
- Pre-commit Hooks: Automatic code quality checks before each commit
The application follows SOLID principles with a clean, modular architecture:
talent8/
├── src/
│ ├── config/ # Configuration management
│ ├── core/ # Core interfaces and exceptions
│ ├── scrapers/
│ │ ├── site_scrapers/ # Site-specific scrapers (Revolut, Paysera)
│ │ ├── jina_reader.py # Jina.ai API integration
│ │ └── scraper_factory.py # Factory for creating scrapers
│ ├── parsers/ # Job data extraction
│ │ ├── base.py # Base parser with common methods
│ │ ├── revolut_parser.py # Revolut-specific parsing
│ │ └── paysera_parser.py # Paysera-specific parsing
│ ├── models/ # Data models (Pydantic)
│ ├── storage/ # File storage and metadata
│ └── utils/ # Utilities (logging, validation)
├── tests/ # Integration tests
├── .pre-commit-config.yaml # Pre-commit hook configuration
└── pyproject.toml # Poetry config + tool settings
- Python 3.11 or higher
- Poetry (for dependency management)
- Clone the repository:
git clone https://github.com/yourusername/talent8.git
cd talent8- Install dependencies with Poetry:
poetry install- Copy the environment configuration:
cp .env.example .env-
Get a Jina.ai API key (required for tests, optional for basic usage):
- Visit https://jina.ai
- Sign up for a free account
- Add your API key to
.env:JINA_API_KEY=your_key_here - Without API key: 20 requests/minute (free tier)
- With API key: 200 requests/minute
-
(Optional) Set up pre-commit hooks:
poetry run pre-commit install# Jina.ai Configuration
JINA_API_KEY=your_api_key_here # Optional, for higher rate limits
JINA_RATE_LIMIT=20 # 20 for free tier, 200 with API key
# Storage Configuration
OUTPUT_DIR=output # Where to save job files
INCLUDE_METADATA=false # Deprecated: metadata now in frontmatter only
# AWS S3 Configuration (for deployment)
AWS_S3_BUCKET= # S3 bucket name (required for --s3-deploy)
AWS_S3_PREFIX=jobs/ # S3 prefix/folder path
AWS_REGION=us-east-1 # AWS region
AWS_PROFILE= # AWS profile name (optional)
# Logging
LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERRORConfigure which sites to scrape:
sites:
- name: "Paysera"
enabled: true
base_url: "https://www.paysera.com"
search_paths:
- "/v2/en/career#positions"
job_listing_selector: "careers?id="
max_pages: 1
- name: "Revolut"
enabled: true
base_url: "https://www.revolut.com"
search_paths:
- "/careers/"
job_listing_selector: "/position/"
max_pages: 1Run the scraper for all enabled sites:
poetry run python -m src.mainpoetry run python -m src.main --site Revolut
poetry run python -m src.main --site Paysera
poetry run python -m src.main --site DOU # DOU.UA (multiple companies)Control the number of jobs processed in parallel (default: 10):
# Process 5 jobs concurrently
poetry run python -m src.main --concurrency 5
poetry run python -m src.main -c 5 # Short form
# Higher concurrency for faster scraping (respects rate limits via semaphore)
poetry run python -m src.main --concurrency 20Scrape jobs and automatically deploy to S3:
# Deploy with default concurrency (10)
poetry run python -m src.main --s3-deploy
# Deploy with custom concurrency
poetry run python -m src.main -c 7 --s3-deploy
# Deploy specific site only
poetry run python -m src.main --site Revolut -c 5 --s3-deployOptions:
-h, --help Show help message
-s, --site SITE Specific site to scrape (Revolut or Paysera)
-o, --output DIR Output directory for job files (default: output)
-c, --concurrency NUM Max concurrent jobs to process (default: 10)
--s3-deploy Deploy scraped files to AWS S3 after scraping
-v, --verbose Enable verbose logging (DEBUG level)The application generates the following output structure (organized by site domain):
output/
├── www.revolut.com/
│ ├── 2025-10-15_revolut_marketing-manager-crypto_abc123.md
│ ├── 2025-10-15_revolut_senior-backend-engineer_def456.md
│ └── 2025-10-15_revolut_product-designer_ghi789.md
├── www.paysera.com/
│ ├── 2025-10-15_paysera_senior-laravel-developer_jkl012.md
│ ├── 2025-10-15_paysera_frontend-developer_mno345.md
│ └── 2025-10-15_paysera_devops-engineer_pqr678.md
└── jobs.dou.ua/
├── 2025-10-15_dou.ua_senior-python-developer_abc123.md
├── 2025-10-15_dou.ua_lead-data-scientist_def456.md
└── 2025-10-15_dou.ua_senior-java-developer_ghi789.md
Each job is saved as a Markdown file with YAML frontmatter (limited to 10 keys for AWS Bedrock compatibility):
---
title: Senior Software Engineer
company: TechCorp
location: Remote
source: TechCorp
url: https://careers.techcorp.com/jobs/12345
scraped_at: 2025-10-15T14:30:00.123456
posted_date: 2025-10-14T10:00:00
job_type: Full-time
experience_level: Senior
remote: true
---
# Senior Software Engineer
**Company:** TechCorp
**Location:** Remote
**Salary:** $120,000 - $180,000
**Type:** Full-time
**Experience:** Senior
**Remote:** Yes
## Description
[Job description content...]
## Requirements
- 5+ years of Python development
- Experience with AWS services
- Docker and Kubernetes proficiency
## Raw Content
[Original scraped content...]Frontmatter Keys (max 10 per AWS Bedrock requirements):
- Required (6 keys): title, company, location, source, url, scraped_at
- Optional (up to 4): posted_date, job_type, experience_level, remote
- Other fields (salary, skills, benefits, etc.) are included in the markdown body only
The generated markdown files are ready for AWS Bedrock Knowledge Base:
- Frontmatter Compliance: Limited to 10 attributes per AWS Bedrock requirements
- Date Fields: Always includes
scraped_at, optionally includesposted_date - Metadata Size: Frontmatter kept minimal to stay well under 2KB limit
- S3 Structure: Output organized by domain for direct S3 upload
- No Separate Metadata Files: All metadata embedded in YAML frontmatter
Talent8 provides native AWS S3 integration for deploying job files to S3 buckets. You can deploy files either:
- Integrated: Automatically after scraping using the
--s3-deployflag - Standalone: Using the dedicated deployment script
-
AWS Credentials: Configure AWS credentials using one of these methods:
# Option 1: Environment variables export AWS_ACCESS_KEY_ID=your_access_key export AWS_SECRET_ACCESS_KEY=your_secret_key # Option 2: AWS credentials file (~/.aws/credentials) [default] aws_access_key_id = your_access_key aws_secret_access_key = your_secret_key # Option 3: AWS profile [profile talent8] aws_access_key_id = your_access_key aws_secret_access_key = your_secret_key
-
S3 Bucket: Create an S3 bucket or use an existing one
aws s3 mb s3://your-talent8-bucket --region us-east-1
-
IAM Permissions: Ensure your AWS credentials have these permissions:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::your-talent8-bucket", "arn:aws:s3:::your-talent8-bucket/*" ] } ] } -
Environment Configuration: Set AWS S3 variables in
.env:AWS_S3_BUCKET=your-talent8-bucket AWS_S3_PREFIX=jobs/ AWS_REGION=us-east-1 AWS_PROFILE= # Optional, for named profiles
Deploy automatically after scraping using the --s3-deploy flag:
# Scrape all sites and deploy to S3 (default concurrency: 10)
poetry run python -m src.main --s3-deploy
# Scrape with custom concurrency and deploy
poetry run python -m src.main -c 7 --s3-deploy
# Scrape specific site and deploy
poetry run python -m src.main --site Revolut -c 5 --s3-deployThe application will:
- Scrape job listings concurrently from configured sites
- Save files locally to
output/directory - Upload all files to S3 bucket
- Display deployment summary (uploaded/failed counts)
Use the standalone script for more control and advanced features:
# Basic usage - deploy all files from output/ directory
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket
# Deploy specific site only
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --site www.revolut.com
# Preview deployment without uploading (dry-run)
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --dry-run
# Smart sync - skip unchanged files
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --sync
# Sync with delete - remove S3 files not present locally
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --sync --delete
# Deploy from custom directory
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --source-dir /path/to/jobs
# Deploy with custom S3 prefix
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --prefix custom-path/
# Use specific AWS profile
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --profile talent8
# Enable verbose logging
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --verboseArguments:
--bucket, -b BUCKET S3 bucket name (required if not in .env)
--prefix, -p PREFIX S3 prefix/folder path (default: jobs/)
--source-dir, -s DIR Local directory to upload (default: output/)
--site SITE Upload specific site folder only (e.g., www.revolut.com)
--region, -r REGION AWS region (default: us-east-1)
--profile PROFILE AWS profile name
--dry-run Preview uploads without actually uploading
--sync Skip unchanged files (compares file sizes)
--delete Delete S3 files not present locally (use with --sync)
--verbose, -v Enable debug logging
-
Smart Upload:
- Preserves directory structure in S3
- Sets correct Content-Type headers (text/markdown for .md, application/json for .json)
- Recursively uploads all files including subdirectories
-
Sync Mode:
- Compares local and S3 files by size
- Skips unchanged files to save time and bandwidth
- Optional delete mode to keep S3 in sync with local files
-
Dry-Run Mode:
- Preview what would be uploaded without making changes
- Useful for testing deployment configurations
- Shows full S3 paths for each file
-
Error Handling:
- Continues uploading even if some files fail
- Reports detailed success/failure statistics
- Provides clear error messages for common issues
# Upload all scraped jobs to S3
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobsOutput:
2025-10-14 - INFO - Deploying from output to s3://talent8-jobs/jobs/
2025-10-14 - INFO - Found 10 files to upload from output
2025-10-14 - INFO - Uploaded: job1.md -> s3://talent8-jobs/jobs/www.revolut.com/job1.md
...
============================================================
Deployment Summary:
Uploaded: 10 files
Failed: 0 files
============================================================
✓ Deployment completed successfully
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs --dry-run# Only upload new/modified files
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs --sync# Sync and delete S3 files that no longer exist locally
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs --sync --deleteError: "AWS_S3_BUCKET not configured"
- Solution: Set
AWS_S3_BUCKETin.envor use--bucketargument
Error: "Access denied to S3 bucket"
- Solution: Check IAM permissions, ensure credentials have s3:PutObject permission
Error: "S3 bucket does not exist"
- Solution: Create bucket with
aws s3 mb s3://your-bucketor check bucket name spelling
Error: "NoCredentialsError"
- Solution: Configure AWS credentials using environment variables, credentials file, or AWS profile
- Interfaces (
src/core/interfaces.py): Abstract interfaces following SOLID principles - Scrapers (
src/scrapers/): Web scraping implementations - Parsers (
src/parsers/): Content parsing and data extraction - Storage (
src/storage/): File storage and metadata generation - Models (
src/models/): Pydantic data models - Utils (
src/utils/): Logging, validation utilities
See CLAUDE.md for detailed instructions on adding a new site scraper.
Quick overview:
- Add site configuration to
src/config/sites.yaml - Create parser in
src/parsers/newsite_parser.pyextendingBaseParser - Create scraper in
src/scrapers/site_scrapers/newsite_scraper.pyextendingBaseSiteScraper - Register scraper in
src/scrapers/scraper_factory.py - Create integration tests in
tests/test_newsite_integration.py
Important: Tests require JINA_API_KEY in .env file. Tests will skip if not set.
# Run all tests
poetry run pytest
# Run specific test file
poetry run pytest tests/test_revolut_integration.py
# Run tests matching pattern
poetry run pytest -k "test_parse_job"
# Run with coverage
poetry run pytest --cov=srcThe project uses automated code quality tools via pre-commit hooks:
# Install pre-commit hooks (one-time setup)
poetry run pre-commit install
# Run all checks manually
poetry run pre-commit run --all-files
# Run specific checks
poetry run ruff check src/ --fix # Lint and auto-fix
poetry run ruff format src/ # Format code
poetry run mypy src/ # Type checking (excludes tests/)
# Checks run automatically on git commit:
# - Ruff (linting + formatting)
# - mypy (type checking)
# - Bandit (security scanning)
# - YAML/file quality checksThe application follows key software design principles:
- YAGNI: Only essential features, no over-engineering
- DRY: Reusable components, no code duplication
- KISS: Simple, straightforward implementations
- SOLID: Interface-based design with dependency injection
- GRASP: Clear responsibilities and information expert pattern
- Strategy Pattern: Site-specific scrapers implement different scraping strategies
- Template Method Pattern:
BaseSiteScraperdefines common workflow, subclasses implement specifics - Factory Pattern:
ScraperFactorycreates appropriate scraper for each site - Dependency Injection: All dependencies passed via constructors (SOLID principles)
- Jina.ai Rate Limits:
- Free tier: 20 requests/minute
- With API key: 200 requests/minute
- AWS Bedrock Requirements:
- Frontmatter limited to maximum 10 attributes
- Metadata must be under 2KB (easily satisfied with 10 keys)
- Each markdown document must not exceed 50MB
- Date fields required (scraped_at always included)
- Dynamic Content: Revolut requires JavaScript injection to load all job listings
- Parser Specificity: Title extraction strategies vary by site (URL-based for Revolut, header-based for Paysera)
MIT License - See LICENSE file for details
- Fork the repository
- Create a feature branch
- Make your changes following the coding standards:
- Follow SOLID principles
- Add type hints to all functions
- Use modern Python syntax (PEP 604 union types:
str | None) - Write integration tests for new features
- Ensure pre-commit hooks pass
- Test your changes:
poetry run pytest - Submit a pull request
- CLAUDE.md: Comprehensive guide for AI assistants and developers
- README.md: This file - user-facing documentation
- .env.example: Environment variable template
For issues or questions:
- Open an issue on GitHub
- Check CLAUDE.md for architecture details
- Review
.env.examplefor configuration options
- Jina.ai for the powerful Reader API
- AWS Bedrock for knowledge base capabilities
- The open-source community for inspiration and tools