Talent8 - Job Scraper with Jina.ai Reader API

A scalable Python application for scraping job data from career sites using Jina.ai's Reader API. The application saves job listings in Markdown format with AWS Bedrock-compatible metadata, ready for S3 deployment.

Features

Site-Specific Scrapers: Currently supports Paysera, Revolut, and DOU.UA (Ukrainian job board with 324 companies)
Jina.ai Integration: Leverages Jina.ai's powerful Reader API for reliable content extraction
Dynamic Content Support: Handles JavaScript-rendered pages (e.g., Revolut's "Show more" functionality)
AWS Bedrock Ready: Generates markdown files with Bedrock-compatible frontmatter (max 10 attributes)
Markdown Output: Clean, readable Markdown files for each job listing with YAML frontmatter
Concurrent Processing: Configurable parallel job processing (default: 10 concurrent jobs)
Async/Await Architecture: High-performance async processing with semaphore-based rate limiting
S3 Deployment: Native AWS S3 integration with sync and dry-run modes
Extensible Design: SOLID principles and clean architecture for easy extension
Code Quality: Automated linting with Ruff, type checking with mypy, security scanning with Bandit
Pre-commit Hooks: Automatic code quality checks before each commit

Architecture

The application follows SOLID principles with a clean, modular architecture:

talent8/
├── src/
│   ├── config/                  # Configuration management
│   ├── core/                    # Core interfaces and exceptions
│   ├── scrapers/
│   │   ├── site_scrapers/       # Site-specific scrapers (Revolut, Paysera)
│   │   ├── jina_reader.py       # Jina.ai API integration
│   │   └── scraper_factory.py   # Factory for creating scrapers
│   ├── parsers/                 # Job data extraction
│   │   ├── base.py              # Base parser with common methods
│   │   ├── revolut_parser.py    # Revolut-specific parsing
│   │   └── paysera_parser.py    # Paysera-specific parsing
│   ├── models/                  # Data models (Pydantic)
│   ├── storage/                 # File storage and metadata
│   └── utils/                   # Utilities (logging, validation)
├── tests/                       # Integration tests
├── .pre-commit-config.yaml      # Pre-commit hook configuration
└── pyproject.toml               # Poetry config + tool settings

Installation

Prerequisites

Python 3.11 or higher
Poetry (for dependency management)

Setup

Clone the repository:

git clone https://github.com/yourusername/talent8.git
cd talent8

Install dependencies with Poetry:

poetry install

Copy the environment configuration:

cp .env.example .env

Get a Jina.ai API key (required for tests, optional for basic usage):
- Visit https://jina.ai
- Sign up for a free account
- Add your API key to .env: JINA_API_KEY=your_key_here
- Without API key: 20 requests/minute (free tier)
- With API key: 200 requests/minute
(Optional) Set up pre-commit hooks:

poetry run pre-commit install

Configuration

Environment Variables (.env)

# Jina.ai Configuration
JINA_API_KEY=your_api_key_here  # Optional, for higher rate limits
JINA_RATE_LIMIT=20               # 20 for free tier, 200 with API key

# Storage Configuration
OUTPUT_DIR=output                # Where to save job files
INCLUDE_METADATA=false          # Deprecated: metadata now in frontmatter only

# AWS S3 Configuration (for deployment)
AWS_S3_BUCKET=                  # S3 bucket name (required for --s3-deploy)
AWS_S3_PREFIX=jobs/              # S3 prefix/folder path
AWS_REGION=us-east-1             # AWS region
AWS_PROFILE=                     # AWS profile name (optional)

# Logging
LOG_LEVEL=INFO                  # DEBUG, INFO, WARNING, ERROR

Site Configuration (src/config/sites.yaml)

Configure which sites to scrape:

sites:
  - name: "Paysera"
    enabled: true
    base_url: "https://www.paysera.com"
    search_paths:
      - "/v2/en/career#positions"
    job_listing_selector: "careers?id="
    max_pages: 1

  - name: "Revolut"
    enabled: true
    base_url: "https://www.revolut.com"
    search_paths:
      - "/careers/"
    job_listing_selector: "/position/"
    max_pages: 1

Usage

Basic Usage

Run the scraper for all enabled sites:

poetry run python -m src.main

Scrape Specific Site

poetry run python -m src.main --site Revolut
poetry run python -m src.main --site Paysera
poetry run python -m src.main --site DOU          # DOU.UA (multiple companies)

Concurrent Processing

Control the number of jobs processed in parallel (default: 10):

# Process 5 jobs concurrently
poetry run python -m src.main --concurrency 5
poetry run python -m src.main -c 5                    # Short form

# Higher concurrency for faster scraping (respects rate limits via semaphore)
poetry run python -m src.main --concurrency 20

Scrape and Deploy to S3

Scrape jobs and automatically deploy to S3:

# Deploy with default concurrency (10)
poetry run python -m src.main --s3-deploy

# Deploy with custom concurrency
poetry run python -m src.main -c 7 --s3-deploy

# Deploy specific site only
poetry run python -m src.main --site Revolut -c 5 --s3-deploy

Command Line Options

Options:
  -h, --help                    Show help message
  -s, --site SITE               Specific site to scrape (Revolut or Paysera)
  -o, --output DIR              Output directory for job files (default: output)
  -c, --concurrency NUM         Max concurrent jobs to process (default: 10)
  --s3-deploy                   Deploy scraped files to AWS S3 after scraping
  -v, --verbose                 Enable verbose logging (DEBUG level)

Output Structure

The application generates the following output structure (organized by site domain):

output/
├── www.revolut.com/
│   ├── 2025-10-15_revolut_marketing-manager-crypto_abc123.md
│   ├── 2025-10-15_revolut_senior-backend-engineer_def456.md
│   └── 2025-10-15_revolut_product-designer_ghi789.md
├── www.paysera.com/
│   ├── 2025-10-15_paysera_senior-laravel-developer_jkl012.md
│   ├── 2025-10-15_paysera_frontend-developer_mno345.md
│   └── 2025-10-15_paysera_devops-engineer_pqr678.md
└── jobs.dou.ua/
    ├── 2025-10-15_dou.ua_senior-python-developer_abc123.md
    ├── 2025-10-15_dou.ua_lead-data-scientist_def456.md
    └── 2025-10-15_dou.ua_senior-java-developer_ghi789.md

Markdown Format

Each job is saved as a Markdown file with YAML frontmatter (limited to 10 keys for AWS Bedrock compatibility):

---
title: Senior Software Engineer
company: TechCorp
location: Remote
source: TechCorp
url: https://careers.techcorp.com/jobs/12345
scraped_at: 2025-10-15T14:30:00.123456
posted_date: 2025-10-14T10:00:00
job_type: Full-time
experience_level: Senior
remote: true
---

# Senior Software Engineer

**Company:** TechCorp
**Location:** Remote
**Salary:** $120,000 - $180,000
**Type:** Full-time
**Experience:** Senior
**Remote:** Yes

## Description
[Job description content...]

## Requirements
- 5+ years of Python development
- Experience with AWS services
- Docker and Kubernetes proficiency

## Raw Content
[Original scraped content...]

Frontmatter Keys (max 10 per AWS Bedrock requirements):

Required (6 keys): title, company, location, source, url, scraped_at
Optional (up to 4): posted_date, job_type, experience_level, remote
Other fields (salary, skills, benefits, etc.) are included in the markdown body only

AWS Bedrock Integration

The generated markdown files are ready for AWS Bedrock Knowledge Base:

Frontmatter Compliance: Limited to 10 attributes per AWS Bedrock requirements
Date Fields: Always includes scraped_at, optionally includes posted_date
Metadata Size: Frontmatter kept minimal to stay well under 2KB limit
S3 Structure: Output organized by domain for direct S3 upload
No Separate Metadata Files: All metadata embedded in YAML frontmatter

S3 Deployment

Talent8 provides native AWS S3 integration for deploying job files to S3 buckets. You can deploy files either:

Integrated: Automatically after scraping using the --s3-deploy flag
Standalone: Using the dedicated deployment script

Prerequisites

AWS Credentials: Configure AWS credentials using one of these methods:

# Option 1: Environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key

# Option 2: AWS credentials file (~/.aws/credentials)
[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key

# Option 3: AWS profile
[profile talent8]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key

S3 Bucket: Create an S3 bucket or use an existing one

aws s3 mb s3://your-talent8-bucket --region us-east-1

IAM Permissions: Ensure your AWS credentials have these permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-talent8-bucket",
        "arn:aws:s3:::your-talent8-bucket/*"
      ]
    }
  ]
}

Environment Configuration: Set AWS S3 variables in .env:

AWS_S3_BUCKET=your-talent8-bucket
AWS_S3_PREFIX=jobs/
AWS_REGION=us-east-1
AWS_PROFILE=                    # Optional, for named profiles

Method 1: Integrated Deployment (Recommended)

Deploy automatically after scraping using the --s3-deploy flag:

# Scrape all sites and deploy to S3 (default concurrency: 10)
poetry run python -m src.main --s3-deploy

# Scrape with custom concurrency and deploy
poetry run python -m src.main -c 7 --s3-deploy

# Scrape specific site and deploy
poetry run python -m src.main --site Revolut -c 5 --s3-deploy

The application will:

Scrape job listings concurrently from configured sites
Save files locally to output/ directory
Upload all files to S3 bucket
Display deployment summary (uploaded/failed counts)

Method 2: Standalone Deployment Script

Use the standalone script for more control and advanced features:

# Basic usage - deploy all files from output/ directory
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket

# Deploy specific site only
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --site www.revolut.com

# Preview deployment without uploading (dry-run)
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --dry-run

# Smart sync - skip unchanged files
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --sync

# Sync with delete - remove S3 files not present locally
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --sync --delete

# Deploy from custom directory
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --source-dir /path/to/jobs

# Deploy with custom S3 prefix
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --prefix custom-path/

# Use specific AWS profile
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --profile talent8

# Enable verbose logging
poetry run python scripts/deploy_to_s3.py --bucket your-talent8-bucket --verbose

Deployment Script Options

Arguments:
  --bucket, -b BUCKET    S3 bucket name (required if not in .env)
  --prefix, -p PREFIX    S3 prefix/folder path (default: jobs/)
  --source-dir, -s DIR   Local directory to upload (default: output/)
  --site SITE           Upload specific site folder only (e.g., www.revolut.com)
  --region, -r REGION   AWS region (default: us-east-1)
  --profile PROFILE     AWS profile name
  --dry-run             Preview uploads without actually uploading
  --sync                Skip unchanged files (compares file sizes)
  --delete              Delete S3 files not present locally (use with --sync)
  --verbose, -v         Enable debug logging

S3 Deployment Features

Smart Upload:
- Preserves directory structure in S3
- Sets correct Content-Type headers (text/markdown for .md, application/json for .json)
- Recursively uploads all files including subdirectories
Sync Mode:
- Compares local and S3 files by size
- Skips unchanged files to save time and bandwidth
- Optional delete mode to keep S3 in sync with local files
Dry-Run Mode:
- Preview what would be uploaded without making changes
- Useful for testing deployment configurations
- Shows full S3 paths for each file
Error Handling:
- Continues uploading even if some files fail
- Reports detailed success/failure statistics
- Provides clear error messages for common issues

Deployment Examples

Example 1: First-time deployment

# Upload all scraped jobs to S3
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs

Output:

2025-10-14 - INFO - Deploying from output to s3://talent8-jobs/jobs/
2025-10-14 - INFO - Found 10 files to upload from output
2025-10-14 - INFO - Uploaded: job1.md -> s3://talent8-jobs/jobs/www.revolut.com/job1.md
...
============================================================
Deployment Summary:
  Uploaded: 10 files
  Failed:   0 files
============================================================
✓ Deployment completed successfully

Example 2: Preview deployment (dry-run)

poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs --dry-run

Example 3: Update existing S3 deployment

# Only upload new/modified files
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs --sync

Example 4: Clean sync (remove deleted jobs from S3)

# Sync and delete S3 files that no longer exist locally
poetry run python scripts/deploy_to_s3.py --bucket talent8-jobs --sync --delete

Troubleshooting S3 Deployment

Error: "AWS_S3_BUCKET not configured"

Solution: Set AWS_S3_BUCKET in .env or use --bucket argument

Error: "Access denied to S3 bucket"

Solution: Check IAM permissions, ensure credentials have s3:PutObject permission

Error: "S3 bucket does not exist"

Solution: Create bucket with aws s3 mb s3://your-bucket or check bucket name spelling

Error: "NoCredentialsError"

Solution: Configure AWS credentials using environment variables, credentials file, or AWS profile

Development

Project Structure

Interfaces (src/core/interfaces.py): Abstract interfaces following SOLID principles
Scrapers (src/scrapers/): Web scraping implementations
Parsers (src/parsers/): Content parsing and data extraction
Storage (src/storage/): File storage and metadata generation
Models (src/models/): Pydantic data models
Utils (src/utils/): Logging, validation utilities

Adding a New Site

See CLAUDE.md for detailed instructions on adding a new site scraper.

Quick overview:

Add site configuration to src/config/sites.yaml
Create parser in src/parsers/newsite_parser.py extending BaseParser
Create scraper in src/scrapers/site_scrapers/newsite_scraper.py extending BaseSiteScraper
Register scraper in src/scrapers/scraper_factory.py
Create integration tests in tests/test_newsite_integration.py

Running Tests

Important: Tests require JINA_API_KEY in .env file. Tests will skip if not set.

# Run all tests
poetry run pytest

# Run specific test file
poetry run pytest tests/test_revolut_integration.py

# Run tests matching pattern
poetry run pytest -k "test_parse_job"

# Run with coverage
poetry run pytest --cov=src

Code Quality

The project uses automated code quality tools via pre-commit hooks:

# Install pre-commit hooks (one-time setup)
poetry run pre-commit install

# Run all checks manually
poetry run pre-commit run --all-files

# Run specific checks
poetry run ruff check src/ --fix        # Lint and auto-fix
poetry run ruff format src/             # Format code
poetry run mypy src/                    # Type checking (excludes tests/)

# Checks run automatically on git commit:
# - Ruff (linting + formatting)
# - mypy (type checking)
# - Bandit (security scanning)
# - YAML/file quality checks

Design Principles

The application follows key software design principles:

YAGNI: Only essential features, no over-engineering
DRY: Reusable components, no code duplication
KISS: Simple, straightforward implementations
SOLID: Interface-based design with dependency injection
GRASP: Clear responsibilities and information expert pattern

Key Design Patterns

Strategy Pattern: Site-specific scrapers implement different scraping strategies
Template Method Pattern: BaseSiteScraper defines common workflow, subclasses implement specifics
Factory Pattern: ScraperFactory creates appropriate scraper for each site
Dependency Injection: All dependencies passed via constructors (SOLID principles)

Limitations & Considerations

Jina.ai Rate Limits:
- Free tier: 20 requests/minute
- With API key: 200 requests/minute
AWS Bedrock Requirements:
- Frontmatter limited to maximum 10 attributes
- Metadata must be under 2KB (easily satisfied with 10 keys)
- Each markdown document must not exceed 50MB
- Date fields required (scraped_at always included)
Dynamic Content: Revolut requires JavaScript injection to load all job listings
Parser Specificity: Title extraction strategies vary by site (URL-based for Revolut, header-based for Paysera)

License

MIT License - See LICENSE file for details

Contributing

Fork the repository
Create a feature branch
Make your changes following the coding standards:
- Follow SOLID principles
- Add type hints to all functions
- Use modern Python syntax (PEP 604 union types: str | None)
- Write integration tests for new features
- Ensure pre-commit hooks pass
Test your changes: poetry run pytest
Submit a pull request

Project Documentation

CLAUDE.md: Comprehensive guide for AI assistants and developers
README.md: This file - user-facing documentation
.env.example: Environment variable template

Support

For issues or questions:

Open an issue on GitHub
Check CLAUDE.md for architecture details
Review .env.example for configuration options

Acknowledgments

Jina.ai for the powerful Reader API
AWS Bedrock for knowledge base capabilities
The open-source community for inspiration and tools

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

alexnodejs/talent8

Folders and files

Latest commit

History

Repository files navigation

Talent8 - Job Scraper with Jina.ai Reader API

Features

Architecture

Installation

Prerequisites

Setup

Configuration

Environment Variables (.env)

Site Configuration (src/config/sites.yaml)

Usage

Basic Usage

Scrape Specific Site

Concurrent Processing

Scrape and Deploy to S3

Command Line Options

Output Structure

Markdown Format

AWS Bedrock Integration

S3 Deployment

Prerequisites

Method 1: Integrated Deployment (Recommended)

Method 2: Standalone Deployment Script

Deployment Script Options

S3 Deployment Features

Deployment Examples

Example 1: First-time deployment

Example 2: Preview deployment (dry-run)

Example 3: Update existing S3 deployment

Example 4: Clean sync (remove deleted jobs from S3)

Troubleshooting S3 Deployment

Development

Project Structure

Adding a New Site

Running Tests

Code Quality

Design Principles

Key Design Patterns

Limitations & Considerations

License

Contributing

Project Documentation

Support

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages