faLearn

Introduction

faLearn provides utilities to transform whole-genome sequence data into feature representations suitable for machine learning, with a focus on $k$-mer based features and XGBoost-compatible input (libsvm). The package is intended for workflows that start from raw or assembled genome FASTA/FNA files and end with model-ready data.

This repository is an offshot of the MIC package. For MIC analysis functions, please refer to that package.

Main features

Convert individual genomes (FASTA/FNA) into $k$-mer counts.
Export $k$-mer counts in XGBoost-friendly libsvm format (one .txt per genome).
Fast $k$-mer counting implemented in C++ (Rcpp) for performance.
Helpers to process directories of genomes in parallel (via future.apply) and to split/combine libsvm files for training/validation workflows.
Memory-efficient XGBoost cross-validation using xgb.cv.lowmem.

Installation

Install from GitHub (development version):

# install.packages("remotes")
remotes::install_github("agerada/faLearn")

Quick examples

libsvm is the preferred input format for XGBoost. Data are represented as a sparce matrix in a special text format.

To convert a directory of genomes (.fna or .fasta) directly to libsvm files (one line per genome):

library(faLearn)
data("example_genomes", package = "faLearn")

# use the first genome from the packaged example_genomes
genome_ds <- example_genomes$genomes[[1]]
tmp_out <- file.path(tempdir(), "kmers")
unlink(tmp_out, recursive = TRUE)
dir.create(tmp_out, recursive = TRUE, showWarnings = FALSE)
target_file <- file.path(normalizePath(tmp_out), paste0(names(example_genomes$genomes)[1], ".txt"))
future::plan(future::sequential) # or multisession, etc.
progressr::with_progress({
  genome_to_libsvm(as.character(genome_ds),
                   target_file,
                   k = 5)
})
#> Loading required namespace: Biostrings
#> [1] TRUE

Count $k$-mers for a single sequence string using the fast C++ implementation:

kmers("ATCGATCGA", k = 3)
#> $kmer_string
#>  [1] "AAA" "AAC" "AAG" "AAT" "ACA" "ACC" "ACG" "ACT" "AGA" "AGC" "AGG" "AGT"
#> [13] "ATA" "ATC" "ATG" "ATT" "CAA" "CAC" "CAG" "CAT" "CCA" "CCC" "CCG" "CCT"
#> [25] "CGA" "CGC" "CGG" "CGT" "CTA" "CTC" "CTG" "CTT" "GAA" "GAC" "GAG" "GAT"
#> [37] "GCA" "GCC" "GCG" "GCT" "GGA" "GGC" "GGG" "GGT" "GTA" "GTC" "GTG" "GTT"
#> [49] "TAA" "TAC" "TAG" "TAT" "TCA" "TCC" "TCG" "TCT" "TGA" "TGC" "TGG" "TGT"
#> [61] "TTA" "TTC" "TTG" "TTT"
#> 
#> $kmer_value
#>  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [39] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Train a small XGBoost model (cross-validated) using $k$-mer features and xgb.cv.lowmem. This example simulates a numeric phenotype (e.g., an MIC value in this case) and demonstrates how to build a feature matrix from multiple genome files.

data("example_genomes", package = "faLearn")

genomes_list <- example_genomes$genomes

# create kmer libsvm features saved to disk
tmp_out <- file.path(tempdir(), "kmers_all")
unlink(tmp_out, recursive = TRUE)
dir.create(tmp_out, recursive = TRUE, showWarnings = FALSE)
future::plan(future::sequential) # or multisession, etc.
progressr::handlers("txtprogressbar")
progressr::with_progress({
  # iterate over genomes by index so we can use the genome id (not contig names)
  for (i in seq_along(genomes_list)) {
    g <- genomes_list[[i]]
    gid <- names(genomes_list)[i]
    target_file <- file.path(normalizePath(tmp_out), paste0(gid, ".txt"))
    genome_to_libsvm(as.character(g), target_file, k = 8)
  }
})

# combine all libsvm files into a single train file (no test split)
# disable shuffling to preserve filename order and align labels
combined_paths <- split_and_combine_files(tmp_out, split = 1.0, overwrite = TRUE, shuffle = FALSE)
train_file <- combined_paths[["train"]]

# align labels to the order used in the combined train file
train_names <- combined_paths[["train_names"]]
# train_names are basenames with .txt; strip to get genome IDs
train_ids <- tools::file_path_sans_ext(train_names)
all_ids <- names(example_genomes$genomes)
id_index <- match(train_ids, all_ids)
labels <- example_genomes$labels[id_index]

# load combined libsvm file into xgboost DMatrix
dtrain <- xgboost::xgb.DMatrix(data = train_file, label = labels)

# run low-memory CV
cv <- xgb.cv.lowmem(data = dtrain,
                    params = list(eta = 0.05),
                    nrounds = 2000,
                    nfold = 3,
                    verbose = 0,
                    early_stopping_rounds = 5,
                    prediction = TRUE)

head(cv$evaluation_log)
#>   iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
#> 1    1        2.204962     0.08800962       2.224972     0.1681292
#> 2    2        2.109566     0.08404007       2.157574     0.1638208
#> 3    3        2.018659     0.08030880       2.097276     0.1612254
#> 4    4        1.932029     0.07677881       2.041901     0.1609919
#> 5    5        1.849487     0.07344387       1.992596     0.1607482
#> 6    6        1.770839     0.07031310       1.947898     0.1625535

# compare predictions to true labels
preds <- cv$pred
plot(labels, preds)

cor(preds, labels)
#> [1] 0.6772776

Contact

Report issues or feature requests on the GitHub repository: https://github.com/agerada/faLearn

Author

Alessandro Gerada

Name		Name	Last commit message	Last commit date
Latest commit History 294 Commits
.github		.github
R		R
data-raw		data-raw
data		data
inst/scripts		inst/scripts
man		man
renv		renv
revdep		revdep
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
cran-comments.md		cran-comments.md
faLearn.Rproj		faLearn.Rproj
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

faLearn

Introduction

Main features

Installation

Quick examples

Contact

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

agerada/faLearn

Folders and files

Latest commit

History

Repository files navigation

faLearn

Introduction

Main features

Installation

Quick examples

Contact

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages