Given a set of DNA sequences, archR enables unsupervised discovery of _de novo_ clusters with characteristic sequence architectures characterized by position-specific motifs or composition of stretches of nucleotides, e.g., CG-richness, etc.

Call this function to process a data set using archR.

archR(
  config,
  seqs_ohe_mat,
  seqs_raw,
  seqs_pos = NULL,
  total_itr = NULL,
  set_ocollation = NULL,
  fresh = TRUE,
  use_oc = NULL,
  o_dir = NULL
)

Arguments

config

archR configuration object as returned by archR_set_config. This is a required argument.

seqs_ohe_mat

A matrix of one-hot encoded sequences with sequences along columns. This is a required argument.

seqs_raw

A DNAStringSet object. The FASTA sequences as a DNAStringSet object. This argument required argument.

seqs_pos

Vector. Specify the tick labels for sequence positions. Default is NULL.

total_itr

Numeric. Specify the number of iterations to perform. This should be greater than zero. Default is NULL.

set_ocollation

Logical vector. A logical vector of length `total_itr` specifying for every iteration of archR if collation of clusters from outer chunks should be performed. TRUE denotes clusters are collated, FALSE otherwise.

fresh

Logical. Specify if this is (not) a fresh run. Because archR enables checkpointing, it is possible to perform additional iterations upon clusters from an existing archR result (or a checkpoint) object. See 'use_oc' argument. For example, when processing a set of FASTA sequences, if an earlier call to archR performed two iterations, and now you wish to perform a third, the arguments `fresh` and `use_oc` can be used. Simply set `fresh` to FALSE and assign the sequence clusters from iteration two from the earlier result to `use_oc`. As of v0.1.3, with this setting, archR returns a new result object as if the additional iteration performed is the only iteration.

use_oc

List. Clusters to be further processed with archR. These can be from a previous archR result (in which case use get_seqs_clust_list function), or simply clusters from any other method. Warning: This has not been rigorously tested yet (v0.1.3).

o_dir

Character. Specify the output directory with its path. archR will create this directory. If a directory with the given name exists at the given location, archR will add a suffix to the directory name. This change is reported to the user. Default is NULL. When NULL, just the result is returned, and no plots or checkpoints or result is written to disk.

Value

A nested list of elements as follows:

seqsClustLabels

A list with cluster labels for all sequences per iteration of archR. The cluster labels as stored as characters.

clustBasisVectors

A list with information on NMF basis vectors per iteration of archR. Per iteration, there are two variables `nBasisVectors` storing the number of basis vectors after model selection, and `basisVectors`, a matrix storing the basis vectors themselves. Dimensions of the `basisVectors` matrix are 4*L x nBasisVectors (mononucleotide case) or 16*L x nBasisVectors (dinucleotide case).

clustSol

The clustering solution obtained upon processing the raw clusters from the last iteration of archR's result. This is handled internally by the function collate_archR_result using the default setting of Euclidean distance and ward.D linkage hierarchical clustering.

rawSeqs

The input sequences as a DNAStringSet object.

timeInfo

Stores the time taken (in minutes) for processing each iteration. This element is added only if `time` flag is set to TRUE in config.

config

The configuration used for processing.

call

The function call itself.

Details

The archR package provides three categories of important functions: related to data preparation and manipulation, performing non-negative matrix factorization, performing clustering, and visualization-related functions.

Functions for data preparation and manipulation

Functions for visualizations

Examples

fname <- system.file("extdata", "example_data.fa", package = "archR", mustWork = TRUE) # Specifying 'dinuc' generates dinucleotide features inputSeqsMat <- archR::prepare_data_from_FASTA(fasta_fname = fname, sinuc_or_dinuc = "dinuc")
#> Sequences OK,
#> Read 200 sequences
#> Generating dinucleotide profiles
inputSeqsRaw <- archR::prepare_data_from_FASTA(fasta_fname = fname, raw_seq = TRUE) # Set archR configuration archRconfig <- archR::archR_set_config( parallelize = TRUE, n_cores = 2, n_runs = 100, k_min = 1, k_max = 20, mod_sel_type = "stability", bound = 10^-8, chunk_size = 100, flags = list(debug = FALSE, time = TRUE, verbose = TRUE, plot = FALSE) ) # Run archR archRresult <- archR::archR(config = archRconfig, seqs_ohe_mat = inputSeqsMat, seqs_raw = inputSeqsRaw, seqs_pos = seq(1,100,by=1), total_itr = 2, set_ocollation = c(TRUE, FALSE))
#> ── Setting up ──────────────────────────────────────────────────────────────────
#> Parallelization: 2 cores
#> Model selection by factor stability
#> Bound: 1e-08
#>
#> ── Iteration 1 of 2 [1 chunk] ──────────────────────────────────────────────────
#>
#> ── Outer chunk 1 of 1 [Size: 200] ──
#>
#> ── Inner chunk 1 of 2 [Size: 100]
#> Checking K = 2
#> Checking K = 3
#> Checking K = 4
#> Best K for this chunk: 3
#> Adjusting for overfitting, fetched 2 clusters
#>
#> ── Inner chunk 2 of 2 [Size: 100]
#> Checking K = 2
#> Checking K = 3
#> Checking K = 4
#> Best K for this chunk: 3
#> 1 of 1 outer chunk complete
#> 1 of 2 iterations complete
#> → Iteration 1 completed: 47.8s
#> → Time ellapsed since start: 47.9s
#>
#> ── Iteration 2 of 2 [3 chunks] ─────────────────────────────────────────────────
#>
#> ── Outer chunk 1 of 3 [Size: 50] ──
#>
#> ── Inner chunk 1 of 1 [Size: 50]
#> Checking K = 2
#> Best K for this chunk: 1
#> 1 of 3 outer chunks complete
#>
#> ── Outer chunk 2 of 3 [Size: 55] ──
#>
#> ── Inner chunk 1 of 1 [Size: 55]
#> Checking K = 2
#> Checking K = 3
#> Best K for this chunk: 2
#> 2 of 3 outer chunks complete
#>
#> ── Outer chunk 3 of 3 [Size: 95] ──
#>
#> ── Inner chunk 1 of 1 [Size: 95]
#> Checking K = 2
#> Checking K = 3
#> Best K for this chunk: 2
#> 3 of 3 outer chunks complete
#> 2 of 2 iterations complete
#> → Iteration 2 completed: 39.6s
#> → Time ellapsed since start: 1m 27.5s
#> ── archR exiting 1m 27.5s ──────────────────────────────────────────────────────