Skip to content

alexnodejs/talent8

Repository files navigation

Talent8 - Job Scraper with Jina.ai Reader API

A scalable Python application for scraping job data from career sites using Jina.ai's Reader API. The application saves job listings in Markdown format with AWS Bedrock-compatible metadata, ready for S3 deployment.

Features

  • Site-Specific Scrapers: Currently supports Paysera, Revolut, and DOU.UA (Ukrainian job board with 324 companies)
  • Jina.ai Integration: Leverages Jina.ai's powerful Reader API for reliable content extraction
  • Dynamic Content Support: Handles JavaScript-rendered pages (e.g., Revolut's "Show more" functionality)
  • AWS Bedrock Ready: Generates markdown files with Bedrock-compatible frontmatter (max 10 attributes)
  • Markdown Output: Clean, readable Markdown files for each job listing with YAML frontmatter
  • Concurrent Processing: Configurable parallel job processing (default: 10 concurrent jobs)
  • Async/Await Architecture: High-performance async processing with semaphore-based rate limiting
  • S3 Deployment: Native AWS S3 integration with sync and dry-run modes
  • Extensible Design: SOLID principles and clean architecture for easy extension
  • Code Quality: Automated linting with Ruff, type checking with mypy, security scanning with Bandit
  • Pre-commit Hooks: Automatic code quality checks before each commit

Architecture

The application follows SOLID principles with a clean, modular architecture:

talent8/
├── src/
│   ├── config/                  # Configuration management
│   ├── core/                    # Core interfaces and exceptions
│   ├── scrapers/
│   │   ├── site_scrapers/       # Site-specific scrapers (Revolut, Paysera)
│   │   ├── jina_reader.py       # Jina.ai API integration
│   │   └── scraper_factory.py   # Factory for creating scrapers
│   ├── parsers/                 # Job data extraction
│   │   ├── base.py              # Base parser with common methods
│   │   ├── revolut_parser.py    # Revolut-specific parsing
│   │   └── paysera_parser.py    # Paysera-specific parsing
│   ├── models/                  # Data models (Pydantic)
│   ├── storage/                 # File storage and metadata
│   └── utils/                   # Utilities (logging, validation)
├── tests/                       # Integration tests
├── .pre-commit-config.yaml      # Pre-commit hook configuration
└── pyproject.toml               # Poetry config + tool settings

Installation

Prerequisites

  • Python 3.11 or higher
  • Poetry (for dependency management)

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/talent8.git
cd talent8
  1. Install dependencies with Poetry:
poetry install
  1. Copy the environment configuration:
cp .env.example .env
  1. Get a Jina.ai API key (required for tests, optional for basic usage):

  2. (Optional) Set up pre-commit hooks:

poetry run pre-commit install

Configuration

Environment Variables (.env)

# Jina.ai Configuration
JINA_API_KEY=your_api_key_here  # Optional, for higher rate limits
JINA_RATE_LIMIT=20               # 20 for free tier, 200 with API key

# Storage Configuration
OUTPUT_DIR=output                # Where to save job files
INCLUDE_METADATA=false          # Deprecated: metadata now in frontmatter only

# AWS S3 Configuration (for deployment)
AWS_S3_BUCKET=                  # S3 bucket name (required for --s3-deploy)
AWS_S3_PREFIX=jobs/              # S3 prefix/folder path
AWS_REGION=us-east-1             # AWS region
AWS_PROFILE=                     # AWS profile name (optional)

# Logging
LOG_LEVEL=INFO                  # DEBUG, INFO, WARNING, ERROR

Site Configuration (src/config/sites.yaml)

Configure which sites to scrape:

sites:
  - name: "Paysera"
    enabled: true
    base_url: "https://www.paysera.com"
    search_paths:
      - "/v2/en/career#positions"
    job_listing_selector: "careers?id="
    max_pages: 1

  - name: "Revolut"
    enabled: true
    base_url: "https://www.revolut.com"
    search_paths:
      - "/careers/"
    job_listing_selector: "/position/"
    max_pages: 1

Usage

Basic Usage

Run the scraper for all enabled sites:

poetry run python -m src.main

Scrape Specific Site

poetry run python -m src.main --site Revolut
poetry run python -m src.main --site Paysera
poetry run python -m src.main --site DOU          # DOU.UA (multiple companies)

Concurrent Processing

Control the number of jobs processed in parallel (default: 10):

# Process 5 jobs concurrently
poetry run python -m src.main --concurrency 5
poetry run python -m src.main -c 5                    # Short form

# Higher concurrency for faster scraping (respects rate limits via semaphore)
poetry run python -m src.main --concurrency 20

Scrape and Deploy to S3

Scrape jobs and automatically deploy to S3:

# Deploy with default concurrency (10)
poetry run python -m src.main --s3-deploy

# Deploy with custom concurrency
poetry run python -m src.main -c 7 --s3-deploy

# Deploy specific site only
poetry run python -m src.main --site Revolut -c 5 --s3-deploy

Command Line Options

Options:
  -h, --help                    Show help message
  -s, --site SITE               Specific site to scrape (Revolut or Paysera)
  -o, --output DIR              Output directory for job files (default: output)
  -c, --concurrency NUM         Max concurrent jobs to process (default: 10)
  --s3-deploy                   Deploy scraped files to AWS S3 after scraping
  -v, --verbose                 Enable verbose logging (DEBUG level)

Output Structure

The application generates the following output structure (organized by site domain):

output/
├── www.revolut.com/
│   ├── 2025-10-15_revolut_marketing-manager-crypto_abc123.md
│   ├── 2025-10-15_revolut_senior-backend-engineer_def456.md
│   └── 2025-10-15_revolut_product-designer_ghi789.md
├── www.paysera.com/
│   ├── 2025-10-15_paysera_senior-laravel-developer_jkl012.md
│   ├── 2025-10-15_paysera_frontend-developer_mno345.md
│   └── 2025-10-15_paysera_devops-engineer_pqr678.md
└── jobs.dou.ua/
    ├── 2025-10-15_dou.ua_senior-python-developer_abc123.md
    ├── 2025-10-15_dou.ua_lead-data-scientist_def456.md
    └── 2025-10-15_dou.ua_senior-java-developer_ghi789.md

Markdown Format

Each job is saved as a Markdown file with YAML frontmatter (limited to 10 keys for AWS Bedrock compatibility):

---
title: Senior Software Engineer
company: TechCorp
location: Remote
source: TechCorp
url: https://careers.techcorp.com/jobs/12345
scraped_at: 2025-10-15T14:30:00.123456
posted_date: 2025-10-14T10:00:00
job_type: Full-time
experience_level: Senior
remote: true
---

# Senior Software Engineer

**Company:** TechCorp
**Location:** Remote
**Salary:** $120,000 - $180,000
**Type:** Full-time
**Experience:** Senior
**Remote:** Yes

## Description
[Job description content...]

## Requirements
- 5+ years of Python development
- Experience with AWS services
- Docker and Kubernetes proficiency

## Raw Content
[Original scraped content...]

Frontmatter Keys (max 10 per AWS Bedrock requirements):

  • Required (6 keys): title, company, location, source, url, scraped_at
  • Optional (up to 4): posted_date, job_type, experience_level, remote
  • Other fields (salary, skills, benefits, etc.) are included in the markdown body only

AWS Bedrock Integration

The generated markdown files are ready for AWS Bedrock Knowledge Base:

  1. Frontmatter Compliance: Limited to 10 attributes per AWS Bedrock requirements
  2. Date Fields: Always includes scraped_at, optionally includes posted_date
  3. Metadata Size: Frontmatter kept minimal to stay well under 2KB limit
  4. S3 Structure: Output organized by domain for direct S3 upload
  5. No Separate Metadata Files: All metadata embedded in YAML frontmatter

S3 Deployment

Talent8 provides native AWS S3 integration for deploying job files to S3 buckets. You can deploy files either:

  1. Integrated: Automatically after scraping using the --s3-deploy flag
  2. Standalone: Using the dedicated deployment script

Prerequisites

  1. AWS Credentials: Configure AWS credentials using one of these methods:

    # Option 1: Environment variables
    export AWS_ACCESS_KEY_ID=your_access_key
    export AWS_SECRET_ACCESS_KEY=your_secret_key
    
    # Option 2: AWS credentials file (~/.aws/credentials)
    [default]
    aws_access_key_id = your_access_key
    aws_secret_access_key = your_secret_key
    
    # Option 3: AWS profile
    [profile talent8]
    aws_access_key_id = your_access_key
    aws_secret_access_key = your_secret_key
  2. S3 Bucket: Create an S3 bucket or use an existing one

    aws s3 mb s3://your-talent8-bucket --region us-east-1
  3. IAM Permissions: Ensure your AWS credentials have these permissions:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:ListBucket",
            "s3:DeleteObject"
          ],
          "Resource": [
            "arn:aws:s3:::your-talent8-bucket",
            "arn:aws:s3:::your-talent8-bucket/*"
          ]
        }
      ]
    }
  4. Environment Configuration: Set AWS S3 variables in .env:

    AWS_S3_BUCKET=your-talent8-bucket
    AWS_S3_PREFIX=jobs/
    AWS_REGION=us-east-1
    AWS_PROFILE=                    # Optional, for named profiles

Method 1: Integrated Deployment (Recommended)

Deploy automatically after scraping using the --s3-deploy flag:

# Scrape all sites and deploy to S3 (default concurrency: 10)
poetry run python -m src.main --s3-deploy

# Scrape with custom concurrency and deploy
poetry run python -m src.main -c 7 --s3-deploy

# Scrape specific site and deploy
poetry run python -m src.main --site Revolut -c 5 --s3-deploy

The application will:

  1. Scrape job listings concurrently from configured sites
  2. Save files locally to output/ directory
  3. Upload all files to S3 bucket
  4. Display deployment summary (uploaded/failed counts)

Method 2: Standalone Deployment Script

Use the standalone script for more control and advanced features:

# Basic usage - deploy all files from output/ directory
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket

# Deploy specific site only
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --site www.revolut.com

# Preview deployment without uploading (dry-run)
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --dry-run

# Smart sync - skip unchanged files
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --sync

# Sync with delete - remove S3 files not present locally
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --sync --delete

# Deploy from custom directory
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --source-dir /path/to/jobs

# Deploy with custom S3 prefix
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --prefix custom-path/

# Use specific AWS profile
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --profile talent8

# Enable verbose logging
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --verbose

Deployment Script Options

Arguments:
  --bucket, -b BUCKET    S3 bucket name (required if not in .env)
  --prefix, -p PREFIX    S3 prefix/folder path (default: jobs/)
  --source-dir, -s DIR   Local directory to upload (default: output/)
  --site SITE           Upload specific site folder only (e.g., www.revolut.com)
  --region, -r REGION   AWS region (default: us-east-1)
  --profile PROFILE     AWS profile name
  --dry-run             Preview uploads without actually uploading
  --sync                Skip unchanged files (compares file sizes)
  --delete              Delete S3 files not present locally (use with --sync)
  --verbose, -v         Enable debug logging

S3 Deployment Features

  1. Smart Upload:

    • Preserves directory structure in S3
    • Sets correct Content-Type headers (text/markdown for .md, application/json for .json)
    • Recursively uploads all files including subdirectories
  2. Sync Mode:

    • Compares local and S3 files by size
    • Skips unchanged files to save time and bandwidth
    • Optional delete mode to keep S3 in sync with local files
  3. Dry-Run Mode:

    • Preview what would be uploaded without making changes
    • Useful for testing deployment configurations
    • Shows full S3 paths for each file
  4. Error Handling:

    • Continues uploading even if some files fail
    • Reports detailed success/failure statistics
    • Provides clear error messages for common issues

Deployment Examples

Example 1: First-time deployment

# Upload all scraped jobs to S3
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs

Output:

2025-10-14 - INFO - Deploying from output to s3://talent8-jobs/jobs/
2025-10-14 - INFO - Found 10 files to upload from output
2025-10-14 - INFO - Uploaded: job1.md -> s3://talent8-jobs/jobs/www.revolut.com/job1.md
...
============================================================
Deployment Summary:
  Uploaded: 10 files
  Failed:   0 files
============================================================
✓ Deployment completed successfully

Example 2: Preview deployment (dry-run)

poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs --dry-run

Example 3: Update existing S3 deployment

# Only upload new/modified files
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs --sync

Example 4: Clean sync (remove deleted jobs from S3)

# Sync and delete S3 files that no longer exist locally
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs --sync --delete

Troubleshooting S3 Deployment

Error: "AWS_S3_BUCKET not configured"

  • Solution: Set AWS_S3_BUCKET in .env or use --bucket argument

Error: "Access denied to S3 bucket"

  • Solution: Check IAM permissions, ensure credentials have s3:PutObject permission

Error: "S3 bucket does not exist"

  • Solution: Create bucket with aws s3 mb s3://your-bucket or check bucket name spelling

Error: "NoCredentialsError"

  • Solution: Configure AWS credentials using environment variables, credentials file, or AWS profile

Development

Project Structure

  • Interfaces (src/core/interfaces.py): Abstract interfaces following SOLID principles
  • Scrapers (src/scrapers/): Web scraping implementations
  • Parsers (src/parsers/): Content parsing and data extraction
  • Storage (src/storage/): File storage and metadata generation
  • Models (src/models/): Pydantic data models
  • Utils (src/utils/): Logging, validation utilities

Adding a New Site

See CLAUDE.md for detailed instructions on adding a new site scraper.

Quick overview:

  1. Add site configuration to src/config/sites.yaml
  2. Create parser in src/parsers/newsite_parser.py extending BaseParser
  3. Create scraper in src/scrapers/site_scrapers/newsite_scraper.py extending BaseSiteScraper
  4. Register scraper in src/scrapers/scraper_factory.py
  5. Create integration tests in tests/test_newsite_integration.py

Running Tests

Important: Tests require JINA_API_KEY in .env file. Tests will skip if not set.

# Run all tests
poetry run pytest

# Run specific test file
poetry run pytest tests/test_revolut_integration.py

# Run tests matching pattern
poetry run pytest -k "test_parse_job"

# Run with coverage
poetry run pytest --cov=src

Code Quality

The project uses automated code quality tools via pre-commit hooks:

# Install pre-commit hooks (one-time setup)
poetry run pre-commit install

# Run all checks manually
poetry run pre-commit run --all-files

# Run specific checks
poetry run ruff check src/ --fix        # Lint and auto-fix
poetry run ruff format src/             # Format code
poetry run mypy src/                    # Type checking (excludes tests/)

# Checks run automatically on git commit:
# - Ruff (linting + formatting)
# - mypy (type checking)
# - Bandit (security scanning)
# - YAML/file quality checks

Design Principles

The application follows key software design principles:

  • YAGNI: Only essential features, no over-engineering
  • DRY: Reusable components, no code duplication
  • KISS: Simple, straightforward implementations
  • SOLID: Interface-based design with dependency injection
  • GRASP: Clear responsibilities and information expert pattern

Key Design Patterns

  • Strategy Pattern: Site-specific scrapers implement different scraping strategies
  • Template Method Pattern: BaseSiteScraper defines common workflow, subclasses implement specifics
  • Factory Pattern: ScraperFactory creates appropriate scraper for each site
  • Dependency Injection: All dependencies passed via constructors (SOLID principles)

Limitations & Considerations

  • Jina.ai Rate Limits:
    • Free tier: 20 requests/minute
    • With API key: 200 requests/minute
  • AWS Bedrock Requirements:
    • Frontmatter limited to maximum 10 attributes
    • Metadata must be under 2KB (easily satisfied with 10 keys)
    • Each markdown document must not exceed 50MB
    • Date fields required (scraped_at always included)
  • Dynamic Content: Revolut requires JavaScript injection to load all job listings
  • Parser Specificity: Title extraction strategies vary by site (URL-based for Revolut, header-based for Paysera)

License

MIT License - See LICENSE file for details

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes following the coding standards:
    • Follow SOLID principles
    • Add type hints to all functions
    • Use modern Python syntax (PEP 604 union types: str | None)
    • Write integration tests for new features
    • Ensure pre-commit hooks pass
  4. Test your changes: poetry run pytest
  5. Submit a pull request

Project Documentation

  • CLAUDE.md: Comprehensive guide for AI assistants and developers
  • README.md: This file - user-facing documentation
  • .env.example: Environment variable template

Support

For issues or questions:

  • Open an issue on GitHub
  • Check CLAUDE.md for architecture details
  • Review .env.example for configuration options

Acknowledgments

  • Jina.ai for the powerful Reader API
  • AWS Bedrock for knowledge base capabilities
  • The open-source community for inspiration and tools

About

Job scraper using Jina.ai Reader API with AWS Bedrock-compatible output

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages