Speed read counts from RNA-Seq

Author: Dr Chibisi Chima-Okereke Created: December 7, 2014 14:21:00 GMT Published: December 7, 2014 14:21:00 GMT

Some time ago, we created a blog about differential expression in drosophilla. Part of the analysis requires reading text files containing count data. This can be done using R’s read.delim function. In this blog entry we present a function readFileOfCounts that is an alternative method for reading files in this format. It is written in C++ using the Rcpp package and there is an R wrapper function that gives access to the functionality. It reads files in about a third of the time that the native R function takes.

Note that the function is designed to read the count file data where the first column represents the symbol column and the other columns contain count data. The link to the code in this article is given in our GitHub page.

The C++ Code

The code presented in this section is the C++ code that actually reads the contents of the file. The output is a list containing a matrix of the numerical data and the column names and rownames of the table.

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
#include <RcppArmadillo.h>
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <sstream>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export(".readFileOfCounts")]]
SEXP readFileOfCounts(std::string filePath, std::string sep) {
    
  ifstream input(filePath);
  string lineStr; // string for the line
  string colStr; // string for the columns
  stringstream strStream {};
  const char * colChar;
  
  int iCol, iRow = 0; // to capture the current column
  // The separator
  char _sep;
  const char * tsep;
  tsep = sep.c_str();
  if(tsep[0] == '\\'){
    _sep = '\t';
    }else{
    _sep = tsep[0];
  }
  
  vector<string> colNames; // column names
  
  // Read the first line into a string vector
  getline(input, lineStr);
  strStream.str(lineStr);
  while(getline(strStream, colStr, _sep))
  {
    colNames.push_back(colStr);
  }
  strStream.clear();
  vector<string> rowNames; // row names
  vector<vector<double>> countData;
  vector<double> rowData;
  iCol = colNames.size() - 1;
  double countValue;
  // Now we read the rest of the file
  while(getline(input, lineStr))
  {
    strStream.str(lineStr);
    getline(strStream, colStr, _sep); // read the rowname name
    rowNames.push_back(colStr);
    while(getline(strStream, colStr, _sep))
    {
      colChar = colStr.c_str();
      countValue = atof(colChar);
      rowData.push_back(countValue);
    }
    countData.push_back(rowData);
    strStream.clear();
    rowData.clear();
    ++iRow;
  }
  
  NumericMatrix rMat(iRow, iCol);
  for(int i = 0; i < iRow; ++i){
    rowData = countData[i];
    for(int j = 0; j < iCol; ++j){
      rMat(i, j) = rowData[j];
    }
    rowData.clear();
  }
  
  CharacterVector _rowNames = wrap(rowNames);
  CharacterVector _colNames = wrap(colNames);
  
  List output; output["mat"] = rMat; output["rowNames"] = _rowNames;
  output["colNames"] = _colNames;
  return wrap(output);
}

The R wrapper function

The R wrapper function contains takes the output list from the C++ function and constructs a data frame from the matrix data and row and column names.

readFileOfCounts <- function(filePath, sep){
  cout <- .readFileOfCounts(filePath, sep)
  output <- cout[["mat"]]
  dimnames(output) <- list(cout[["rowNames"]], cout[["colNames"]][-1])
  return(as.data.frame(output))
}

Usage

The usage of these functions is given below:

require(Rcpp)
sourceCpp("readFileOfCounts.cpp")
source("readFileOfCounts.R")
mb <- microbenchmark::microbenchmark
benchmarks <- mb(readFileOfCounts("count_table.txt", sep = "\t"), read.delim("count_table.txt", row.names="gene"))
Unit: milliseconds
                                              expr     min       lq     mean   median       uq      max neval
  readFileOfCounts("count_table.txt", sep = "\\t") 254.419 284.8199 292.3839 288.3123 293.8638 408.4934   100
 read.delim("count_table.txt", row.names = "gene") 798.816 817.1757 838.4695 831.1102 862.1574 895.1991   100