问题描述
假设我有一个动态数量的来自文件(条形码)的输入字符串。 我想根据与输入字符串的匹配来拆分一个巨大的 111GB 文本文件,并将这些命中写入文件。
我不知道需要多少输入。
理想情况下,我会为输入向量条形码中的每个输入打开一个文件,只包含字符串。有没有办法打开动态数量的输出文件?
一种次优方法是搜索条形码字符串作为输入参数,但这意味着我必须重复读取大文件。
条码输入向量只包含字符串,例如 "塔格塔","TAGAGTAG",
理想情况下,如果输入前两个字符串,则输出应如下所示
file1 -> TAGAGTAT.txt
file2 -> TAGAGTAG.txt
感谢您的帮助。
extern crate needletail;
use needletail::{parse_fastx_file,Sequence,FastxReader};
use std::str;
use std::fs::File;
use std::io::prelude::*;
use std::path::Path;
fn read_barcodes () -> Vec<String> {
// Todo - can replace this with file reading code (OR move to an arguments based model,parse and demultiplex only one oligomer at a time..... )
// The `vec!` macro can be used to initialize a vector or strings
let barcodes = vec![
"TCTCAAAG".to_string(),"AACTCCGC".into(),"TAAACGCG".into()
];
println!("Initial vector: {:?}",barcodes);
return barcodes
}
fn main() {
//let filename = "test5m.fastq";
let filename = "Undetermined_S0_R1.fastq";
println!("Fastq filename: {} ",filename);
//println!("Barcodes filename: {} ",barcodes_filename);
let barcodes_vector: Vec<String> = read_barcodes();
let mut counts_vector: [i32; 30] = [0; 30];
let mut n_bases = 0;
let mut n_valid_kmers = 0;
let mut reader = parse_fastx_file(&filename).expect("Not a valid path/file");
while let Some(record) = reader.next() {
let seqrec = record.expect("invalid record");
// get sequence
let sequenceBytes = seqrec.normalize(false);
let sequenceText = str::from_utf8(&sequenceBytes).unwrap();
//println!("Seq: {} ",&sequenceText);
// get first 8 chars (8chars x 2 bytes)
let sequenceOligo = &sequenceText[0..8];
//println!("barcode vector {},seqOligo {} ",&barcodes_vector[0],sequenceOligo);
if sequenceOligo == barcodes_vector[0]{
//println!("Hit ! Barcode vector {},sequenceOligo);
counts_vector[0] = counts_vector[0] + 1;
}
解决方法
您可能想要一个 HashMap<String,File>
。您可以像这样从条形码矢量构建它:
use std::collections::HashMap;
use std::fs::File;
use std::path::Path;
fn build_file_map(barcodes: &[String]) -> HashMap<String,File> {
let mut files = HashMap::new();
for barcode in barcodes {
let filename = Path::new(barcode).with_extension("txt");
let file = File::create(filename).expect("failed to create output file");
files.insert(barcode.clone(),file);
}
files
}
你可以这样称呼它:
let barcodes = vec!["TCTCAAAG".to_string(),"AACTCCGC".into(),"TAAACGCG".into()];
let file_map = build_file_map(&barcodes);
你会得到一个像这样写入的文件:
let barcode = barcodes[0];
let file = file_map.get(&barcode).expect("barcode not in file map");
// write to file
,
我只需要一个示例:a) 如何正确实例化以相关字符串命名的文件向量 b) 正确设置输出文件对象 c) 写入这些文件。
这是一个注释示例:
use std::io::Write;
use std::fs::File;
use std::io;
fn read_barcodes() -> Vec<String> {
// read barcodes here
todo!()
}
fn process_barcode(barcode: &str) -> String {
// process barcodes here
todo!()
}
fn main() -> io::Result<()> {
let barcodes = read_barcodes();
for barcode in barcodes {
// process barcode to get output
let output = process_barcode(&barcode);
// create file for barcode with {barcode}.txt name
let mut file = File::create(format!("{}.txt",barcode))?;
// write output to created file
file.write_all(output.as_bytes());
}
Ok(())
}