Skip to content

rdcdc/nyff-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYFF Scraper

A Python tool for scraping the New York Film Festival (NYFF) website, useful for planning what you want to see at the festival.

Author: Jack Murphy

Features

The script grabs everything showing at NYFF, looks up IMDb data for production company information, runtimes, possible release date if available, and compiles it. It also looks up trailers, whether a screening is sold out, etc.

The NYFF doesn't have centralized trailers, nor will it list release dates - this does that for you.

Quick Start

Installation

git clone <repository-url>
cd nyff-scraper
pip install -e .

Basic Usage

# Scrape NYFF 2025 lineup with full enrichment
nyff-scraper

# Scrape a custom URL
nyff-scraper https://www.filmlinc.org/nyff/nyff63-lineup/

# Test with limited films
nyff-scraper --limit 10

# Skip trailer search (faster)
nyff-scraper --skip-trailers

# Export only specific formats
nyff-scraper --csv-only

Output Formats

JSON

The JSON output is provided as a structured format that can be used however you like. You can load it into a spreadsheet, another script, or even an LLM if you want to ask more complex questions. For example, you could ask it to find three films that are unlikely to be in theatres next year and don’t have overlapping showtimes, or filter by whether you want to attend introductions or avoid them.

You can look through the JSON file or the CSV to see what kinds of fields there are.

CSV

Flattened data suitable for spreadsheet analysis with one row per showtime - that I know can be a bit much all at once, however it allows you to filter in a more robust way.

Markdown

Human-readable format perfect for documentation and sharing.

Installation Options

Option 1: pip install (Recommended)

# Clone the repository
gh repo clone rdcdc/nyff-scraper
cd nyff-scraper

# Install in development mode
pip install -e .

# Or install from PyPI (when published)
pip install nyff-scraper

Option 2: Manual setup

# Clone and install dependencies
gh repo clone rdcdc/nyff-scraper
cd nyff-scraper
pip install -r requirements.txt

# Run directly
python -m src.nyff_scraper.cli

Usage Examples

Command Line Interface

The nyff-scraper command provides a comprehensive CLI with many options:

# Full pipeline with all features
nyff-scraper

# Scrape only (no enrichment)
nyff-scraper --only-scrape

# Skip specific enrichment steps
nyff-scraper --skip-imdb --skip-trailers

# Check your letterboxd account for things it could recommend (experimental)
nyff-scraper --letterboxd yourusername

# Custom output location
nyff-scraper --output-dir ./results --output-name my_films

# Test with limited data
nyff-scraper --limit 5 --verbose

Python API

You can also use the components directly in Python:

from nyff_scraper import NYFFScraper, IMDbEnricher, TrailerEnricher
from nyff_scraper.exporters import export_all_formats

# Initialize components
scraper = NYFFScraper()
imdb_enricher = IMDbEnricher()
trailer_enricher = TrailerEnricher()

# Scrape films
films = scraper.scrape_nyff_lineup()

# Enrich with IMDb data
films = imdb_enricher.enrich_films(films)

# Add trailers
films = trailer_enricher.enrich_films(films, search_trailers=True)

# Export to all formats
export_all_formats(films, "my_films")

Command Line Options

positional arguments:
  url                   URL to scrape (default: NYFF 2025 lineup)

processing options:
  --only-scrape         Only scrape film data, skip IMDb and trailer enrichment
  --skip-imdb          Skip IMDb enrichment (production companies, distributors)
  --skip-trailers      Skip YouTube trailer search
  --limit N            Limit processing to first N films (useful for testing)

output options:
  --output-dir DIR     Output directory for generated files (default: current directory)
  --output-name NAME   Base name for output files (default: nyff_films)
  --cache-dir DIR      Directory for caching web requests (default: cache)

export format options:
  --json-only          Export only JSON format
  --csv-only           Export only CSV format
  --markdown-only      Export only Markdown format

utility options:
  --verbose, -v        Enable verbose logging
  --quiet, -q          Suppress all output except errors
  --help, -h           Show this help message and exit

Project Structure

nyff-scraper/
├── src/
│   └── nyff_scraper/
│       ├── __init__.py          # Package initialization
│       ├── cli.py               # Command-line interface
│       ├── scraper.py           # Web scraping functionality
│       ├── imdb_enricher.py     # IMDb data enrichment
│       ├── trailer_enricher.py  # YouTube trailer search
│       └── exporters.py         # Data export modules
├── tests/                       # Test suite
├── scripts/                     # Additional utility scripts
├── pyproject.toml              # Project configuration
├── requirements.txt            # Core dependencies
├── requirements-dev.txt        # Development dependencies
├── README.md                   # This file
└── .gitignore                  # Git ignore patterns

Development

Setting up for development

# Clone the repository
git clone <repository-url>
cd nyff-scraper

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Or use requirements files
pip install -r requirements-dev.txt

# Set up pre-commit hooks
pre-commit install

Running tests

# Run all tests
pytest

# Run with coverage
pytest --cov=nyff_scraper

# Run specific test file
pytest tests/test_scraper.py

Code formatting and linting

# Format code
black src/ tests/

# Sort imports
isort src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

Architecture

The scraper is built with a modular architecture:

  1. NYFFScraper: Handles web scraping of film lineup pages
  2. IMDbEnricher: Searches IMDb and extracts production/distribution data
  3. TrailerEnricher: Searches YouTube for film trailers
  4. Exporters: Convert data to various output formats (JSON, CSV, Markdown)
  5. CLI: Command-line interface tying everything together

Each module can be used independently, making it easy to customize the workflow or extend functionality.

Extending for Other Festivals

The architecture is designed to be extensible. To adapt for other film festivals:

  1. Create a new scraper class inheriting from a base scraper
  2. Implement festival-specific parsing logic
  3. Update the CLI to support the new festival
  4. Add festival-specific configuration

Dependencies

  • requests: HTTP library for web scraping
  • beautifulsoup4: HTML parsing and extraction
  • lxml: Fast XML/HTML parser

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite and linting
  6. Commit your changes (git commit -am 'Add new feature')
  7. Push to the branch (git push origin feature/new-feature)
  8. Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • New York Film Festival for providing accessible film data
  • IMDb for production and distribution information
  • YouTube for trailer hosting and search capabilities

Troubleshooting

Common Issues

"No films found": Check that the URL is correct and the website structure hasn't changed.

Rate limiting: The scraper includes delays to be respectful to servers. For faster testing, use --limit option.

Missing dependencies: Ensure all requirements are installed with pip install -r requirements.txt.

Permission errors: Make sure you have write permissions in the output directory.

Getting Help

  • Check the Issues page for known problems
  • Create a new issue with detailed error information
  • Use --verbose flag for detailed logging when reporting issues

About

Python CLI tool that scrapes the NYFF (New York Film Festival) schedule, adds trailers, and exports JSON/CSV/MD.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages