Given a set of sequences in a FASTA file this function returns a sparse matrix with one-hot encoded sequences. In this matrix, the sequence features are along rows, and sequences along columns. Currently, mono- and dinucleotide features for DNA sequences are supported. Therefore, the length of the feature vector is 4 and 16 times the length of the sequences (since the DNA alphabet is four characters) for mono- and dinucleotide features respectively.

prepare_data_from_FASTA(fasta_fname, raw_seq = FALSE, sinuc_or_dinuc = "sinuc")

Arguments

fasta_fname

Provide the name (with complete path) of the input FASTA file.

raw_seq

TRUE or FALSE, set this to TRUE if you want the raw sequences.

sinuc_or_dinuc

character string, 'sinuc' or 'dinuc' to select for mono- or dinucleotide profiles.

Value

A sparse matrix of sequences represented with one-hot-encoding.

See also

get_one_hot_encoded_seqs for directly using a DNAStringSet object

Other input functions: get_one_hot_encoded_seqs()

Examples


fname <- system.file("extdata", "example_data.fa.gz",
                        package = "seqArchR", mustWork = TRUE)

# mononucleotides feature matrix
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
                        sinuc_or_dinuc = "sinuc")
#> Sequences OK, 
#> Read 200 sequences
#> Generating dinucleotide profiles

# dinucleotides feature matrix
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
                        sinuc_or_dinuc = "dinuc")
#> Sequences OK, 
#> Read 200 sequences
#> Generating dinucleotide profiles

# FASTA sequences as a Biostrings::DNAStringSet object
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
                        raw_seq = TRUE)