R/prepare_data_from_FASTA.R
prepare_data_from_FASTA.Rd
Given a set of sequences in a FASTA file this function returns a sparse matrix with one-hot encoded sequences. In this matrix, the sequence features are along rows, and sequences along columns. Currently, mono- and dinucleotide features for DNA sequences are supported. Therefore, the length of the feature vector is 4 and 16 times the length of the sequences (since the DNA alphabet is four characters) for mono- and dinucleotide features respectively.
prepare_data_from_FASTA(fasta_fname, raw_seq = FALSE, sinuc_or_dinuc = "sinuc")
Provide the name (with complete path) of the input FASTA file.
TRUE or FALSE, set this to TRUE if you want the raw sequences.
character string, 'sinuc' or 'dinuc' to select for mono- or dinucleotide profiles.
A sparse matrix of sequences represented with one-hot-encoding.
get_one_hot_encoded_seqs
for directly using a
DNAStringSet object
Other input functions:
get_one_hot_encoded_seqs()
fname <- system.file("extdata", "example_data.fa.gz",
package = "seqArchR", mustWork = TRUE)
# mononucleotides feature matrix
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
sinuc_or_dinuc = "sinuc")
#> Sequences OK,
#> Read 200 sequences
#> Generating dinucleotide profiles
# dinucleotides feature matrix
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
sinuc_or_dinuc = "dinuc")
#> Sequences OK,
#> Read 200 sequences
#> Generating dinucleotide profiles
# FASTA sequences as a Biostrings::DNAStringSet object
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
raw_seq = TRUE)