transGFM: Transfer Learning for Generalized Factor Models to obtain 'blessing of dimension'.

Overview

The transGFM package implements transfer learning methods for generalized factor models with support for continuous, count (Poisson), and binary data types. The package provides functions for single and multiple source transfer learning, source detection to identify positive and negative transfer sources, and information criteria for rank selection. The methods are particularly useful for high-dimensional data analysis where auxiliary information from related source datasets can improve estimation efficiency in the target domain.

The package is now available on CRAN under the name transGFM.

Installation

From CRAN (stable version)

install.packages("transGFM")

# Load the package
library(transGFM)

From GitHub (development version)

You can also install the development version of transGFM from GitHub:

# Install devtools if you haven't already
install.packages("devtools")

# Install transGFM from GitHub
devtools::install_github("zjwangATsu/transGFM")

# Load the package
library(transGFM)

Key Features

Single Source Transfer Learning: Transfer knowledge from one source dataset to improve target estimation
Multiple Source Transfer Learning: Leverage multiple related source datasets simultaneously
Source Detection: Automatically identify positive and negative transfer sources
Flexible Data Types: Support for continuous (Gaussian), count (Poisson), and binary data
Missing Data Handling: Automatically handles missing values in source and target datasets
Information Criteria: Built-in IC1 and IC2 criteria for selecting the number of factors
Cross-Validation: Automatic parameter selection via cross-validation

Main Functions

Core Functions

transGFM(): Single source transfer learning for generalized factor models
transGFM_multi(): Multiple source transfer learning with AD (Aggregation then Debiasing) and DA (Debiasing then Aggregation) methods

Source Detection Functions

source_potential(): Identify potential sources based on rank comparison using IC criterion
source_detection(): Detect positive/negative transfer sources using ratio-based method

Utility Functions

ic_criterion(): Information criterion (IC1/IC2) for selecting number of factors
relative_error(): Calculate relative Frobenius norm error between matrices
identify(): Factor decomposition using MLE (Maximum Likelihood Estimation)

Supported Data Types

The package supports three types of data:

Data Type	Distribution	Use Case
`continuous`	Gaussian	Real-valued data
`count`	Poisson	Count data
`binary`	Binomial	Binary outcomes

Quick Start

Load the Package

library(transGFM)
set.seed(2025)

Example 1: Single Source Transfer Learning (Count Data)

# Generate Poisson data

# Source data (100 × 100 with 10% missing)
n1 <- 100; p1 <- 100; r <- 2
F_source <- matrix(runif(n1 * r, min = -2, max = 2), n1, r)
B_source <- matrix(runif(p1 * r, min = -2, max = 2), p1, r)
M_source <- F_source %*% t(B_source)
lambda_source <- exp(M_source)
X_source <- matrix(rpois(n1 * p1, as.vector(lambda_source)), n1, p1)

# Add 10% missing values to source
n_missing <- floor(n1 * p1 * 0.1)
missing_idx <- sample(n1 * p1, n_missing)
X_source[missing_idx] <- NA

# Target data (50 × 50, complete)
n0 <- 50; p0 <- 50
M_target_true <- M_source[1:n0, 1:p0]
lambda_target <- exp(M_target_true)
X_target <- matrix(rpois(n0 * p0, as.vector(lambda_target)), n0, p0)

# Run transGFM
result <- transGFM(
  source_data = X_source,
  target_data = X_target,
  r = 2,
  data_type = "count",
  lambda_seq = seq(0, 5, by = 1),
  K_cv = 3,
  verbose = FALSE
)

# Check results
print(paste("Optimal lambda:", result$optimal_lambda))
print(paste("Relative error:", relative_error(result$M_trans, M_target_true)))

Example 2: Multiple Source Transfer Learning

# Generate 3 source datasets
n1 <- 100; p1 <- 100; r <- 2
source_list <- list()

F_s <- matrix(runif(n1 * r, min = -2, max = 2), n1, r)
B_s <- matrix(runif(p1 * r, min = -2, max = 2), p1, r)
M_s <- F_s %*% t(B_s)

for (s in 1:3) {
  X_s <- matrix(rpois(n1 * p1, exp(M_s)), n1, p1)
  
  # Add missing values
  missing_rate <- 0.1 + (s - 1) * 0.02
  n_missing <- floor(n1 * p1 * missing_rate)
  missing_idx <- sample(n1 * p1, n_missing)
  X_s[missing_idx] <- NA
  
  source_list[[s]] <- X_s
}

# Target data
n0 <- 50; p0 <- 50
M_target_true <- M_s[1:n0, 1:p0]
X_target <- matrix(rpois(n0 * p0, exp(M_target_true)), n0, p0)

# Run transGFM_multi with AD method
result_AD <- transGFM_multi(
  source_data_list = source_list,
  target_data = X_target,
  r = 2,
  data_type = "count",
  method = "AD",
  lambda_seq = seq(0, 5, by = 1),
  K_cv = 3,
  verbose = FALSE
)

# Run transGFM_multi with DA method
result_DA <- transGFM_multi(
  source_data_list = source_list,
  target_data = X_target,
  r = 2,
  data_type = "count",
  method = "DA",
  verbose = FALSE
)

# Compare results
print(paste("AD method error:", relative_error(result_AD$M_trans, M_target_true)))
print(paste("DA method error:", relative_error(result_DA$M_trans, M_target_true)))

Example 3: Source Detection (Rank-Based)

# Generate 5 sources with different ranks
n1 <- 100; p1 <- 100
source_list <- list()

# Sources 1-2: rank 2 (same as target)
r_s <- 2
F_s <- matrix(runif(n1 * r_s, min = -2, max = 2), n1, r_s)
B_s <- matrix(runif(p1 * r_s, min = -2, max = 2), p1, r_s)
M_s <- F_s %*% t(B_s)

for (s in 1:2) {
  X_s <- matrix(rpois(n1 * p1, exp(M_s)), n1, p1)
  n_missing <- floor(n1 * p1 * 0.1)
  missing_idx <- sample(n1 * p1, n_missing)
  X_s[missing_idx] <- NA
  source_list[[s]] <- X_s
}

# Sources 3-5: rank 3 (different from target)
for (s in 3:5) {
  r_s_nega <- 3
  F_s_nega <- matrix(runif(n1 * r_s_nega, min = -2, max = 2), n1, r_s_nega)
  B_s_nega <- matrix(runif(p1 * r_s_nega, min = -2, max = 2), p1, r_s_nega)
  M_s_nega <- F_s_nega %*% t(B_s_nega)
  X_s_nega <- matrix(rpois(n1 * p1, exp(M_s_nega)), n1, p1)
  
  n_missing <- floor(n1 * p1 * 0.1)
  missing_idx <- sample(n1 * p1, n_missing)
  X_s_nega[missing_idx] <- NA
  
  source_list[[s]] <- X_s_nega
}

# Target data: rank 2
n0 <- 50; p0 <- 50
M_target <- M_s[1:n0, 1:p0]
X_target <- matrix(rpois(n0 * p0, exp(M_target)), n0, p0)

# Identify potential sources
result <- source_potential(
  X_sources = source_list,
  X0 = X_target,
  r_max = 5,
  ic_type = "IC1",
  data_type = "count",
  verbose = TRUE
)

print(result$positive_potential_sources)  # Should be c(1, 2)
print(result$negative_sources)            # Should be c(3, 4, 5)
print(result$r_target)                    # Should be 2
print(result$r_sources)                   # Should be c(2, 2, 3, 3, 3)

Example 4: Selecting Number of Factors

# Generate Poisson data with known rank
n <- 50; p <- 50; r_true <- 2

# Generate true factors
F_true <- matrix(runif(n * r_true, min = -1, max = 1), n, r_true)
B_true <- matrix(runif(p * r_true, min = -1, max = 1), p, r_true)
M_true <- F_true %*% t(B_true)

# Generate Poisson observations
lambda <- exp(M_true)
X <- matrix(rpois(n * p, as.vector(lambda)), n, p)

# Add 10% missing values
n_missing <- floor(n * p * 0.1)
missing_idx <- sample(n * p, n_missing)
X[missing_idx] <- NA

# Use IC1 to select rank
result_IC1 <- ic_criterion(
  X = X,
  r_max = 5,
  ic_type = "IC1",
  data_type = "count",
  verbose = FALSE
)

print(paste("True rank:", r_true))
print(paste("Estimated rank (IC1):", result_IC1$r_hat))

# Use IC2 to select rank
result_IC2 <- ic_criterion(
  X = X,
  r_max = 5,
  ic_type = "IC2",
  data_type = "count",
  verbose = FALSE
)

print(paste("Estimated rank (IC2):", result_IC2$r_hat))

Main Parameters

Data and Model Specification

source_data / source_data_list: Source dataset(s) (may contain NA for missing values)
target_data: Target dataset (n0 × p0 matrix)
r: Number of factors
data_type: Type of data - "continuous", "count", or "binary"

Transfer Learning Parameters

lambda_seq: Sequence of lambda values for cross-validation
K_cv: Number of folds for cross-validation (default: 5)
method: Fusion method for multiple sources - "AD" (Aggregation then Debiasing) or "DA" (Debiasing then Aggregation)

Algorithm Parameters

C: CJMLE projection constant (if NULL, auto-calculated)
max_iter_cjmle: Maximum iterations for CJMLE (default: 30)
max_iter_refine: Maximum iterations for refinement (default: 30)
max_iter_nuclear: Maximum iterations for nuclear norm optimization (default: 100)
verbose: Print progress information (default: FALSE)

Information Criterion Parameters

r_max: Maximum number of factors to consider
ic_type: IC criterion type - "IC1" or "IC2"

Methodological Details

The transGFM package employs:

CJMLE Algorithm: Maximum Likelihood Estimation for factor models with missing data
Transfer Learning Framework:
- Single source: Direct transfer with cross-validated regularization
- Multiple sources: AD (aggregate then debias) or DA (debias then aggregate) methods
Source Detection:
- Rank-based: Compare estimated ranks using IC criterion
- Ratio-based: Use likelihood ratio to identify beneficial sources
Missing Data: Automatically handles missing values through indicator matrix approach
Convergence: Monitors likelihood changes until convergence

Model Characteristics

The transGFM package is specifically designed to handle:

High-dimensional data where the number of variables may exceed observations
Missing data in both source and target datasets
Multiple data types including continuous, count, and binary
Heterogeneous sources with different sample sizes and missing rates
Large Feature through bless of dimension

The transfer learning framework can significantly improve estimation accuracy when:

Source and target domains share similar latent factor structures
Source datasets have more observations or lower missing rates
Multiple complementary sources are available

Dependencies

R (≥ 3.5.0)
stats

Citation

The related papers are currently under review. We will update the citation information once they are published.

Authors

Zhijing Wang (Creator & Maintainer， SJTU)
Peirong Xu (SJTU)
Hongyu Zhao (Yale)
Tao Wang (SJTU)

Bug Reports and Issues

Please report any bugs or issues on the GitHub Issues page. When reporting, include:

A clear description of the issue
Reproducible code example
Your session information (sessionInfo())
Any relevant error messages

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss proposed modifications.

License

This package is licensed under GPL-3. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
R		R
man		man
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
transGFM.Rproj		transGFM.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

transGFM: Transfer Learning for Generalized Factor Models to obtain 'blessing of dimension'.

Overview

Installation

From CRAN (stable version)

From GitHub (development version)

Key Features

Main Functions

Core Functions

Source Detection Functions

Utility Functions

Supported Data Types

Quick Start

Load the Package

Example 1: Single Source Transfer Learning (Count Data)

Example 2: Multiple Source Transfer Learning

Example 3: Source Detection (Rank-Based)

Example 4: Selecting Number of Factors

Main Parameters

Data and Model Specification

Transfer Learning Parameters

Algorithm Parameters

Information Criterion Parameters

Methodological Details

Model Characteristics

Dependencies

Citation

Authors

Bug Reports and Issues

Contributing

License

About

Uh oh!

Releases 2

Languages

zjwangATsu/transGFM

Folders and files

Latest commit

History

Repository files navigation

transGFM: Transfer Learning for Generalized Factor Models to obtain 'blessing of dimension'.

Overview

Installation

From CRAN (stable version)

From GitHub (development version)

Key Features

Main Functions

Core Functions

Source Detection Functions

Utility Functions

Supported Data Types

Quick Start

Load the Package

Example 1: Single Source Transfer Learning (Count Data)

Example 2: Multiple Source Transfer Learning

Example 3: Source Detection (Rank-Based)

Example 4: Selecting Number of Factors

Main Parameters

Data and Model Specification

Transfer Learning Parameters

Algorithm Parameters

Information Criterion Parameters

Methodological Details

Model Characteristics

Dependencies

Citation

Authors

Bug Reports and Issues

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Languages