Reading List

The Selfish Gene
The Psychopath Test: A Journey Through the Madness Industry
Bad Science
The Feynman Lectures on Physics
The Theory of Everything: The Origin and Fate of the Universe


ifknot's favorite books »

Saturday, 17 October 2020

Reading dumb data into the C++ heterogeneous data_frame

 

Ingest some CSV

Last month I presented an R-ish data_frame class as a small side project, this month I present a C++ equivalent of the R read.csv()function to import data into a data_frame class.

Ongoing development is at the github repo: https://github.com/ifknot/rpp

To recap the motivation for the C++ heterogenous  data_frame class was two-fold:
  1. Runtime handling of dumb data whose format, types, and fields are unknown.
  2. Enable data science skill transfer from the R functional programming environment into C++.
The motivation for the read_csv()function remains the same as that for heterogenous the  data_frame class:

"I want to be able to do the same sort of thing that I do R, but in C++".

Which, this time around, means that I want to be able to use one of the easiest and most reliable ways of getting data in - text files.

In particular CSV (comma-separated values) files. The CSV file format uses commas to separate the different elements in a line, and each line of data is in its own line in the text file, which makes CSV files ideal for representing tabular data - i.e. the  data_frame class.

R comes with a healthy supply of inbuilt data sets to practice with, and so, for development purposes I have borrowed the mpg.csv dataset of miles per gallon (mpg) performance data for a range of make and models of cars. 

I want to be able to do what I do in R - desiderata:
# Import the data and look at the first six rows
car_data <- read.csv(file = 'data/mpg.csv')
head(car_data)
manufacturer model hwy displ year cyl drv cty trans fl class
(str) (str) (int) (num) (date) (int) (str) (int) (str) (str) (str)
1 audi a4 29 1.8 1999 4 f 18 auto(l5) p compact
2 audi a4 29 1.8 1999 4 f 21 manual(m5) p compact
3 audi a4 31 2 2008 4 f 20 manual(m6) p compact
4 audi a4 30 2 2008 4 f 21 auto(av) p compact
5 audi a4 26 2.8 1999 6 f 16 auto(l5) p compact
6 audi a4 26 2.8 1999 6 f 18 manual(m5) p compact
view raw mpg.R hosted with ❤ by GitHub

But in C++ - ipsa:
#include <iostream>
#include "data_frame.h"
#include "read_csv.h"
#include "head.h"
using namespace R;
int main() {
std::cout << "read dumb data into heterogeneous container\n\n";
auto car_data = read_csv("mpg.csv");
std::cout << head(car_data);
}
view raw mpg.cpp hosted with ❤ by GitHub

It does this (unlike R in C++, indexing begins from 0)
read dumb data into heterogeneous container
manufacturer model hwy displ year cyl drv cty trans fl class
(str) (str) (int) (num) (int) (int) (str) (int) (str) (str) (str)
0 audi a4 29 1.8 1999 4 f 18 auto(l5) p compact
1 audi a4 29 1.8 1999 4 f 21 manual(m5) p compact
2 audi a4 31 2 2008 4 f 20 manual(m6) p compact
3 audi a4 30 2 2008 4 f 21 auto(av) p compact
4 audi a4 26 2.8 1999 6 f 16 auto(l5) p compact
5 audi a4 26 2.8 1999 6 f 18 manual(m5) p compact
view raw mpg.txt hosted with ❤ by GitHub

Unfortunately, not only is the tabulation not smart, the date is not read correctly either! This would be solved in R by converting the column using the R as.Date() function. Therefore, a C++ equivalent of the as.Dates() function will also be required. 

Parsing the CSV input file

read_csv()

The R read.csv() function is a wrapper function for read.table() that mandates a comma as seperator and uses the input file's first line as header that specifies the table's column names. However, until such time as I develop this more the C++ read_csv() function is embodied in its own right.


1. Tokenization

At this stage the tokenizer is not, yet, equipped to handle dates or complex numbers:
#pragma once
#include <string>
namespace R {
/**
* The 7 R-ish data types:
* r_raw, r_integer, r_numeric, r_string, r_logical, r_complex, r_date
* converted to tokens with an extra 'broken' token for unrecognized type
*/
enum class token_t { raw_t, integer_t, numeric_t, string_t, logical_t, complex_t, date_t, broken_t };
/**
* parsing input streams requires the ability to recognize the limited R-ish PODs
* @note unrecognized lexemes return btoken_t
*
* @param lexeme - the basic lexical unit to tokenize
*/
token_t tokenize(std::string& lexeme);
}
view raw tokenize.h hosted with ❤ by GitHub

#include "tokenize.h"
#include <regex>
namespace R {
token_t tokenize(std::string& lexeme) {
if (lexeme == "true" || lexeme == "false") {
return token_t::logical_t;
}
// regex found on https://stackoverflow.com/questions/12643009/regular-expression-for-floating-point-numbers
if (std::regex_match(lexeme, std::regex("[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"))) {
try {
// use a dirty trick to see if the lexeme format is a valid integer or double by trying to convert it
std::stod(lexeme); // try convert
// if get here then int or double
if (lexeme.find_first_not_of("+-0123456789") == std::string::npos) {
return token_t::integer_t;
}
else {
return token_t::numeric_t;
}
}
//otherwise catch the exception and tokenize accordingly
catch (std::invalid_argument) {
return token_t::broken_t;
}
catch (std::out_of_range) {
return token_t::broken_t;
}
}
// R is flexible with string (aka character) types but I will need to enforce quotes
if (lexeme[0] == '"' && lexeme.size() > 2 && lexeme[lexeme.size() - 1] == '"') {
return token_t::string_t;
}
// The best approximation for an R raw type is the C++ char
if (lexeme[0] == '\'' && lexeme.size() == 3 && lexeme[lexeme.size() - 1] == '\'') {
return token_t::raw_t;
}
// TODO: complex_t
// TODO: date_t
return token_t::broken_t;
}
}
view raw tokenize.cpp hosted with ❤ by GitHub

2. Evaluator

The evaluator stage can be built into the read_csv() function body as a simple if and switch filter:
#pragma once
#include "data_frame.h"
namespace R {
// TODO: column names and classes
data_frame read_csv(std::string file_path, bool has_header = true);
}
view raw read_csv.h hosted with ❤ by GitHub

#include "read_csv.h"
#include "tokenize.h"
#include <fstream>
#include <sstream>
#include <iostream>
namespace R {
data_frame read_csv(std::string file_path, bool has_header) {
data_frame d;
try {
std::ifstream file(file_path);
if (!file.is_open()) {
throw std::ifstream::failure(file_path);
}
file.exceptions (std::ifstream::badbit);
std::string line;
size_t nline{ 1 }, nfield{ 0 }; // line number n, field number nn
std::vector<std::string> column;
if (has_header) {
std::getline(file, line);
std::istringstream iss(line);
std::string field;
while (getline(iss, field, ',')) {
if (tokenize(field) == token_t::string_t) {
field = field.substr(1, field.size() - 2); // chop enclosing sigils
column.push_back(field); // collect column name
d[field]; // construct empty column
}
else {
throw std::runtime_error(
"record " + std::to_string(nline) + " field " + std::to_string(nfield) + " malformed string " + field
);
}
nfield++;
}
nline++;
}
while (std::getline(file, line)) {
std::istringstream iss(line);
std::string field;
nfield = 0;
while (getline(iss, field, ',')) {
switch (tokenize(field)) {
case token_t::logical_t:
d[column[nfield]].push_back((field == "true") ? true : false);
break;
case token_t::integer_t:
d[column[nfield]].push_back(stoi(field));
break;
case token_t::numeric_t:
d[column[nfield]].push_back(stod(field));
break;
case token_t::complex_t:
// TODO: complex_t
break;
case token_t::date_t:
// TODO: date_t
case token_t::string_t:
d[column[nfield]].push_back(field.substr(1, field.size() - 2));
break;
case token_t::raw_t:
d[column[nfield]].push_back(field.substr(1, field.size() - 2));
break;
case token_t::broken_t:
throw std::runtime_error(
file_path + " broken record on line " + std::to_string(nline) + " field " + std::to_string(nfield) + " : " + field
);
break;
}
nfield++;
}
nline++;
}
}
catch (const std::exception& e) {
std::cerr << e.what() << std::endl;
}
return d;
}
}
view raw read_csv.cpp hosted with ❤ by GitHub
As yet, the read_csv() function does not implement the full functionality of the R read.csv() function i.e. column names and classes

Printing the first part of a data_frame

head()

The C++ version of the R head() function is straightforward:
#pragma once
#include "data_structures.h"
namespace R {
/**
* Returns the first n items of a variant vector
*/
variant_vector head(const variant_vector& x, size_t n = 6);
/**
* Returns the first n rows of a data frame
*/
data_frame head(const data_frame& x, size_t n = 6);
}
view raw head.h hosted with ❤ by GitHub

#include "head.h"
namespace R {
variant_vector head(const variant_vector& x, size_t n) {
return variant_vector (x.begin(), (x.size() < n) ? x.end() : x.begin() + 6);
}
data_frame head(const data_frame& x, size_t n) {
data_frame df;
for (const auto& [key, vctr] : x) { // keys as column headings
df[key] = head(vctr, n);
}
return df;
}
}
view raw headp.cpp hosted with ❤ by GitHub

 Converting a column of integers or strings into an R-ish date type

Conversion to dates will require a specific date type. However, unlike R  where dates are represented as the number of days since 1970-01-01, with negative values for earlier dates, I have chosen std:tm which is a more intuitive structure holding a calendar date and time broken down into its components.

r_date

#pragma once
#include <string>
#include <ctime>
#include <ostream>
namespace R {
/**
* An R-ish calender date type
* This is the type to use if you have only dates, but no times, in your data.
* Default format is "%Y-%m-%d" e.g. 2020 - 09 - 30
* functions: as_dates, diffdates, +, -, etc
*/
struct r_date {
std::tm tm{};
std::string format{};
};
}
std::ostream& operator<<(std::ostream& os, const R::r_date& date);
bool operator == (const R::r_date& lhs, const R::r_date& rhs);
bool operator < (const R::r_date& lhs, const R::r_date& rhs);
view raw r_date.h hosted with ❤ by GitHub

#include "r_date.h"
#include <iomanip>
#include <cassert>
#include <algorithm>
std::ostream& operator<<(std::ostream& os, const R::r_date& date) {
os << std::put_time(&date.tm, date.format.c_str());
return os;
}
bool operator == (const R::r_date& lhs, const R::r_date& rhs) {
auto t1 = std::mktime(const_cast<tm*>(&lhs.tm));
auto t2 = std::mktime(const_cast<tm*>(&rhs.tm));
// -1 if time cannot be represented as std::time_t
assert(t1 != -1 && t2 != -1);
return t1 == t2;
}
bool operator < (const R::r_date& lhs, const R::r_date& rhs) {
auto t1 = std::mktime(const_cast<tm*>(&lhs.tm));
auto t2 = std::mktime(const_cast<tm*>(&rhs.tm));
// -1 if time cannot be represented as std::time_t
assert(t1 != -1 && t2 != -1);
return std::difftime(t2, t1) > 0;
}
view raw r_date.cpp hosted with ❤ by GitHub

as_dates()

With a date type in hand the C++ version of the R as.Date() function becomes:
#pragma once
#include "data_structures.h"
namespace R {
/**
* @brief (R-ish) as_dates convert between string representations and objects of type r_date representing
* calendar dates.
*
* @param dates
* @param format - an override std::get_time compatible format string
*/
variant_vector as_dates(const variant_vector& dates, std::string format);
/**
* @brief (R-ish) as_dates convert between string representations and objects of type r_date representing
* calendar dates.
*
* @param dates
* @param format - an override std::get_time compatible format string b
*/
variant_vector as_dates(variant_vector&& dates, std::string format);
}
view raw as_dates.h hosted with ❤ by GitHub

#include "as_dates.h"
#include <iomanip>
#include <sstream>
#include <stdexcept>
namespace R {
variant_vector as_dates(const variant_vector& dates, std::string format) {
variant_vector tm_dates;
for (const auto& date : dates) {
r_string d;
switch (date.index()) {
case _str:
d = std::get<_str>(date);
break;
case _int:
d = std::to_string(std::get<_int>(date));
break;
default:
throw std::invalid_argument(std::string(__func__) + " invalid argument " + index_to_string[date.index()]);
}
std::istringstream ss(d);
std:tm tm;
std::time_t t = std::time(nullptr);
tm = *std::localtime(&t); // ensure no tm elements are undefined, or mktime will fail
ss >> std::get_time(&tm, format.c_str());
r_date tm_date;
tm_date.format = format;
tm_date.tm = tm;
tm_dates.push_back(tm_date);
}
return tm_dates;
}
variant_vector as_dates(variant_vector&& dates, std::string format) {
variant_vector tm_dates;
for (const auto& date : dates) {
r_string d;
switch (date.index()) {
case _str:
d = std::get<_str>(date);
break;
case _int:
d = std::to_string(std::get<_int>(date));
break;
default:
throw std::invalid_argument(std::string(__func__) + " invalid argument " + index_to_string[date.index()]);
}
std::istringstream ss(d);
std:tm tm;
std::time_t t = std::time(nullptr);
tm = *std::localtime(&t); // ensure no tm elements are undefined, or mktime will fail
ss >> std::get_time(&tm, format.c_str());
r_date tm_date;
tm_date.format = format;
tm_date.tm = tm;
tm_dates.push_back(tm_date);
}
return tm_dates;
}
}
view raw as_dates.cpp hosted with ❤ by GitHub

Example Usage

Bringing it all together:
#include <iostream>
#include "data_frame.h"
#include "read_csv.h"
#include "head.h"
#include "as_dates.h"
using namespace R;
int main() {
std::cout << "read dumb data into heterogeneous container\n\n";
auto car_data = read_csv("mpg.csv");
try {
car_data["year"] = as_dates({ car_data["year"] }, "%Y");
}
catch (const std::exception& e) {
std::cerr << e.what() << "\n\n";
}
std::cout << head(car_data);
}
view raw main.cpp hosted with ❤ by GitHub

Now the dates a converted and displayed properly, but there is still some smart tabulating to add at some point...
read dumb data into heterogeneous container
manufacturer model hwy displ year cyl drv cty trans fl class
(str) (str) (int) (num) (date) (int) (str) (int) (str) (str) (str)
0 audi a4 29 1.8 1999 4 f 18 auto(l5) p compact
1 audi a4 29 1.8 1999 4 f 21 manual(m5) p compact
2 audi a4 31 2 2008 4 f 20 manual(m6) p compact
3 audi a4 30 2 2008 4 f 21 auto(av) p compact
4 audi a4 26 2.8 1999 6 f 16 auto(l5) p compact
5 audi a4 26 2.8 1999 6 f 18 manual(m5) p compact
view raw example.txt hosted with ❤ by GitHub


photo credit: justgrimes data (scrabble) via photopin (license)




No comments:

Post a Comment