NYFF Scraper

A Python tool for scraping the New York Film Festival (NYFF) website, useful for planning what you want to see at the festival.

Author: Jack Murphy

Features

The script grabs everything showing at NYFF, looks up IMDb data for production company information, runtimes, possible release date if available, and compiles it. It also looks up trailers, whether a screening is sold out, etc.

The NYFF doesn't have centralized trailers, nor will it list release dates - this does that for you.

Quick Start

Installation

git clone <repository-url>
cd nyff-scraper
pip install -e .

Basic Usage

# Scrape NYFF 2025 lineup with full enrichment
nyff-scraper

# Scrape a custom URL
nyff-scraper https://www.filmlinc.org/nyff/nyff63-lineup/

# Test with limited films
nyff-scraper --limit 10

# Skip trailer search (faster)
nyff-scraper --skip-trailers

# Export only specific formats
nyff-scraper --csv-only

Output Formats

JSON

The JSON output is provided as a structured format that can be used however you like. You can load it into a spreadsheet, another script, or even an LLM if you want to ask more complex questions. For example, you could ask it to find three films that are unlikely to be in theatres next year and don’t have overlapping showtimes, or filter by whether you want to attend introductions or avoid them.

You can look through the JSON file or the CSV to see what kinds of fields there are.

CSV

Flattened data suitable for spreadsheet analysis with one row per showtime - that I know can be a bit much all at once, however it allows you to filter in a more robust way.

Markdown

Human-readable format perfect for documentation and sharing.

Installation Options

Option 1: pip install (Recommended)

# Clone the repository
gh repo clone rdcdc/nyff-scraper
cd nyff-scraper

# Install in development mode
pip install -e .

# Or install from PyPI (when published)
pip install nyff-scraper

Option 2: Manual setup

# Clone and install dependencies
gh repo clone rdcdc/nyff-scraper
cd nyff-scraper
pip install -r requirements.txt

# Run directly
python -m src.nyff_scraper.cli

Usage Examples

Command Line Interface

The nyff-scraper command provides a comprehensive CLI with many options:

# Full pipeline with all features
nyff-scraper

# Scrape only (no enrichment)
nyff-scraper --only-scrape

# Skip specific enrichment steps
nyff-scraper --skip-imdb --skip-trailers

# Check your letterboxd account for things it could recommend (experimental)
nyff-scraper --letterboxd yourusername

# Custom output location
nyff-scraper --output-dir ./results --output-name my_films

# Test with limited data
nyff-scraper --limit 5 --verbose

Python API

You can also use the components directly in Python:

from nyff_scraper import NYFFScraper, IMDbEnricher, TrailerEnricher
from nyff_scraper.exporters import export_all_formats

# Initialize components
scraper = NYFFScraper()
imdb_enricher = IMDbEnricher()
trailer_enricher = TrailerEnricher()

# Scrape films
films = scraper.scrape_nyff_lineup()

# Enrich with IMDb data
films = imdb_enricher.enrich_films(films)

# Add trailers
films = trailer_enricher.enrich_films(films, search_trailers=True)

# Export to all formats
export_all_formats(films, "my_films")

Command Line Options

positional arguments:
  url                   URL to scrape (default: NYFF 2025 lineup)

processing options:
  --only-scrape         Only scrape film data, skip IMDb and trailer enrichment
  --skip-imdb          Skip IMDb enrichment (production companies, distributors)
  --skip-trailers      Skip YouTube trailer search
  --limit N            Limit processing to first N films (useful for testing)

output options:
  --output-dir DIR     Output directory for generated files (default: current directory)
  --output-name NAME   Base name for output files (default: nyff_films)
  --cache-dir DIR      Directory for caching web requests (default: cache)

export format options:
  --json-only          Export only JSON format
  --csv-only           Export only CSV format
  --markdown-only      Export only Markdown format

utility options:
  --verbose, -v        Enable verbose logging
  --quiet, -q          Suppress all output except errors
  --help, -h           Show this help message and exit

Project Structure

nyff-scraper/
├── src/
│   └── nyff_scraper/
│       ├── __init__.py          # Package initialization
│       ├── cli.py               # Command-line interface
│       ├── scraper.py           # Web scraping functionality
│       ├── imdb_enricher.py     # IMDb data enrichment
│       ├── trailer_enricher.py  # YouTube trailer search
│       └── exporters.py         # Data export modules
├── tests/                       # Test suite
├── scripts/                     # Additional utility scripts
├── pyproject.toml              # Project configuration
├── requirements.txt            # Core dependencies
├── requirements-dev.txt        # Development dependencies
├── README.md                   # This file
└── .gitignore                  # Git ignore patterns

Development

Setting up for development

# Clone the repository
git clone <repository-url>
cd nyff-scraper

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Or use requirements files
pip install -r requirements-dev.txt

# Set up pre-commit hooks
pre-commit install

Running tests

# Run all tests
pytest

# Run with coverage
pytest --cov=nyff_scraper

# Run specific test file
pytest tests/test_scraper.py

Code formatting and linting

# Format code
black src/ tests/

# Sort imports
isort src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

Architecture

The scraper is built with a modular architecture:

NYFFScraper: Handles web scraping of film lineup pages
IMDbEnricher: Searches IMDb and extracts production/distribution data
TrailerEnricher: Searches YouTube for film trailers
Exporters: Convert data to various output formats (JSON, CSV, Markdown)
CLI: Command-line interface tying everything together

Each module can be used independently, making it easy to customize the workflow or extend functionality.

Extending for Other Festivals

The architecture is designed to be extensible. To adapt for other film festivals:

Create a new scraper class inheriting from a base scraper
Implement festival-specific parsing logic
Update the CLI to support the new festival
Add festival-specific configuration

Dependencies

requests: HTTP library for web scraping
beautifulsoup4: HTML parsing and extraction
lxml: Fast XML/HTML parser

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Make your changes
Add tests for new functionality
Run the test suite and linting
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

New York Film Festival for providing accessible film data
IMDb for production and distribution information
YouTube for trailer hosting and search capabilities

Troubleshooting

Common Issues

"No films found": Check that the URL is correct and the website structure hasn't changed.

Rate limiting: The scraper includes delays to be respectful to servers. For faster testing, use --limit option.

Missing dependencies: Ensure all requirements are installed with pip install -r requirements.txt.

Permission errors: Make sure you have write permissions in the output directory.

Getting Help

Check the Issues page for known problems
Create a new issue with detailed error information
Use --verbose flag for detailed logging when reporting issues

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
src/nyff_scraper		src/nyff_scraper
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NYFF Scraper

Features

Quick Start

Installation

Basic Usage

Output Formats

JSON

CSV

Markdown

Installation Options

Option 1: pip install (Recommended)

Option 2: Manual setup

Usage Examples

Command Line Interface

Python API

Command Line Options

Project Structure

Development

Setting up for development

Running tests

Code formatting and linting

Architecture

Extending for Other Festivals

Dependencies

Contributing

License

Acknowledgments

Troubleshooting

Common Issues

Getting Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages