Skip to content

Search, download, and process public domain texts from Project Gutenberg

Notifications You must be signed in to change notification settings

ropensci/gutenbergr

Repository files navigation

gutenbergr gutenbergr website

CRAN version CRAN checks rOpenSci peer-review Project Status: Active R-CMD-check Integration Tests Codecov test coverage Monthly Downloads Total Downloads

Search, download, and process public domain texts from the Project Gutenberg collection.

Installation

Install the released version from CRAN:

install.packages("gutenbergr")

Install the development version from GitHub:

# install.packages("pak")
pak::pak("ropensci/gutenbergr")

Quick Start

Load the package:

library(gutenbergr)
library(dplyr)

We’ll get and set our Project Gutenberg mirror:

gutenberg_get_mirror()
#> [1] "https://aleph.pglaf.org"

Search through the metadata to find a book:

gutenberg_works(title == "Persuasion")
#> # A tibble: 1 × 8
#>   gutenberg_id title      author       gutenberg_author_id language
#>          <int> <chr>      <chr>                      <int> <fct>   
#> 1          105 Persuasion Austen, Jane                  68 en      
#>   gutenberg_bookshelf                           rights                    has_text
#>   <chr>                                         <fct>                     <lgl>   
#> 1 Category: Novels/Category: British Literature Public domain in the USA. TRUE

Persuasion’s gutenberg_id is 105. We’ll use it to download it. We’ll set our cache option to "persistent" so that we don’t have to re-download it later.

options(gutenbergr_cache_type = "persistent")
persuasion <- gutenberg_download(105)
persuasion
#> # A tibble: 8,357 × 2
#>    gutenberg_id text            
#>           <int> <chr>           
#>  1          105 "Persuasion"    
#>  2          105 ""              
#>  3          105 ""              
#>  4          105 "by Jane Austen"
#>  5          105 ""              
#>  6          105 "(1818)"        
#>  7          105 ""              
#>  8          105 ""              
#>  9          105 ""              
#> 10          105 ""              
#> # ℹ 8,347 more rows

Multiple works can be downloaded at once. We’ll add title data from the metadata.

books <- gutenberg_download(c(105, 161), meta_fields = "title")
books |> count(title)
#> # A tibble: 2 × 2
#>   title                           n
#>   <chr>                       <int>
#> 1 Persuasion                   8357
#> 2 Renascence, and Other Poems  1222

Vignettes

See the following vignettes for more advanced usage of gutenbergr.

FAQ

How were the metadata files generated?

See the data-raw directory for scripts. Metadata was generated from the Project Gutenberg catalog on 11 January 2026.

Do you respect robot access rules?

Yes! The package follows Project Gutenberg’s rules:

  • Retrieves books directly from mirrors using the authorized link format
  • Prioritizes .zip files to minimize bandwidth
  • Supports session and persistent caching
  • This package is designed for downloading individual works or small collections, not the entire corpus. For bulk downloads, set up a mirror.

See their Terms of Use for details.

Contributing

See CONTRIBUTING.md.

Note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

ropensci_footer