R/prepare_data_from_FASTA.R
prepare_data_from_FASTA.Rd
Given a set of sequences in a FASTA file this function returns a sparse matrix with one-hot encoded sequences. In this matrix, the sequence features are along rows, and sequences along columns. Currently, mono- and dinucleotide features for DNA sequences are supported. Therefore, the length of the feature vector is 4 and 16 times the length of the sequences (since the DNA alphabet is four characters) for mono- and dinucleotide features respectively.
prepare_data_from_FASTA(fasta_fname, raw_seq = FALSE, sinuc_or_dinuc = "sinuc")
fasta_fname | Provide the name (with complete path) of the input FASTA file. |
---|---|
raw_seq | TRUE or FALSE, set this to TRUE if you want the raw sequences. |
sinuc_or_dinuc | character string, 'sinuc' or 'dinuc' to select for mono- or dinucleotide profiles. |
A sparse matrix of sequences represented with one-hot-encoding.
get_one_hot_encoded_seqs
for directly using a
DNAStringSet object
Other input functions:
get_one_hot_encoded_seqs()
fname <- system.file("extdata", "example_data.fa", package = "archR", mustWork = TRUE) # mononucleotides feature matrix rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname, sinuc_or_dinuc = "sinuc")#>#>#># dinucleotides feature matrix rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname, sinuc_or_dinuc = "dinuc")#>#>#># FASTA sequences as a Biostrings::DNAStringSet object rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname, raw_seq = TRUE)