Ingest some CSV
Last month I presented an R-ish data_frame class as a small side project, this month I present a C++ equivalent of the R read.csv()function to import data into a data_frame class.
Ongoing development is at the github repo: https://github.com/ifknot/rpp
To recap the motivation for the C++ heterogenous data_frame class was two-fold:
- Runtime handling of dumb data whose format, types, and fields are unknown.
- Enable data science skill transfer from the R functional programming environment into C++.
The motivation for the read_csv()function remains the same as that for heterogenous the data_frame class:
"I want to be able to do the same sort of thing that I do R, but in C++".
Which, this time around, means that I want to be able to use one of the easiest and most reliable ways of getting data in - text files.
In particular CSV (comma-separated values) files. The CSV file format uses commas to separate the different elements in a line, and each line of data is in its own line in the text file, which makes CSV files ideal for representing tabular data - i.e. the data_frame class.
R comes with a healthy supply of inbuilt data sets to practice with, and so, for development purposes I have borrowed the mpg.csv dataset of miles per gallon (mpg) performance data for a range of make and models of cars.
I want to be able to do what I do in R - desiderata:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Import the data and look at the first six rows | |
car_data <- read.csv(file = 'data/mpg.csv') | |
head(car_data) | |
manufacturer model hwy displ year cyl drv cty trans fl class | |
(str) (str) (int) (num) (date) (int) (str) (int) (str) (str) (str) | |
1 audi a4 29 1.8 1999 4 f 18 auto(l5) p compact | |
2 audi a4 29 1.8 1999 4 f 21 manual(m5) p compact | |
3 audi a4 31 2 2008 4 f 20 manual(m6) p compact | |
4 audi a4 30 2 2008 4 f 21 auto(av) p compact | |
5 audi a4 26 2.8 1999 6 f 16 auto(l5) p compact | |
6 audi a4 26 2.8 1999 6 f 18 manual(m5) p compact |
But in C++ - ipsa:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <iostream> | |
#include "data_frame.h" | |
#include "read_csv.h" | |
#include "head.h" | |
using namespace R; | |
int main() { | |
std::cout << "read dumb data into heterogeneous container\n\n"; | |
auto car_data = read_csv("mpg.csv"); | |
std::cout << head(car_data); | |
} |
It does this (unlike R in C++, indexing begins from 0)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
read dumb data into heterogeneous container | |
manufacturer model hwy displ year cyl drv cty trans fl class | |
(str) (str) (int) (num) (int) (int) (str) (int) (str) (str) (str) | |
0 audi a4 29 1.8 1999 4 f 18 auto(l5) p compact | |
1 audi a4 29 1.8 1999 4 f 21 manual(m5) p compact | |
2 audi a4 31 2 2008 4 f 20 manual(m6) p compact | |
3 audi a4 30 2 2008 4 f 21 auto(av) p compact | |
4 audi a4 26 2.8 1999 6 f 16 auto(l5) p compact | |
5 audi a4 26 2.8 1999 6 f 18 manual(m5) p compact |
Unfortunately, not only is the tabulation not smart, the date is not read correctly either! This would be solved in R by converting the column using the R as.Date() function. Therefore, a C++ equivalent of the as.Dates() function will also be required.
Parsing the CSV input file
read_csv()
The R read.csv() function is a wrapper function for read.table() that mandates a comma as seperator and uses the input file's first line as header that specifies the table's column names. However, until such time as I develop this more the C++ read_csv() function is embodied in its own right.
1. Tokenization
At this stage the tokenizer is not, yet, equipped to handle dates or complex numbers:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#pragma once | |
#include <string> | |
namespace R { | |
/** | |
* The 7 R-ish data types: | |
* r_raw, r_integer, r_numeric, r_string, r_logical, r_complex, r_date | |
* converted to tokens with an extra 'broken' token for unrecognized type | |
*/ | |
enum class token_t { raw_t, integer_t, numeric_t, string_t, logical_t, complex_t, date_t, broken_t }; | |
/** | |
* parsing input streams requires the ability to recognize the limited R-ish PODs | |
* @note unrecognized lexemes return btoken_t | |
* | |
* @param lexeme - the basic lexical unit to tokenize | |
*/ | |
token_t tokenize(std::string& lexeme); | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include "tokenize.h" | |
#include <regex> | |
namespace R { | |
token_t tokenize(std::string& lexeme) { | |
if (lexeme == "true" || lexeme == "false") { | |
return token_t::logical_t; | |
} | |
// regex found on https://stackoverflow.com/questions/12643009/regular-expression-for-floating-point-numbers | |
if (std::regex_match(lexeme, std::regex("[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"))) { | |
try { | |
// use a dirty trick to see if the lexeme format is a valid integer or double by trying to convert it | |
std::stod(lexeme); // try convert | |
// if get here then int or double | |
if (lexeme.find_first_not_of("+-0123456789") == std::string::npos) { | |
return token_t::integer_t; | |
} | |
else { | |
return token_t::numeric_t; | |
} | |
} | |
//otherwise catch the exception and tokenize accordingly | |
catch (std::invalid_argument) { | |
return token_t::broken_t; | |
} | |
catch (std::out_of_range) { | |
return token_t::broken_t; | |
} | |
} | |
// R is flexible with string (aka character) types but I will need to enforce quotes | |
if (lexeme[0] == '"' && lexeme.size() > 2 && lexeme[lexeme.size() - 1] == '"') { | |
return token_t::string_t; | |
} | |
// The best approximation for an R raw type is the C++ char | |
if (lexeme[0] == '\'' && lexeme.size() == 3 && lexeme[lexeme.size() - 1] == '\'') { | |
return token_t::raw_t; | |
} | |
// TODO: complex_t | |
// TODO: date_t | |
return token_t::broken_t; | |
} | |
} |
2. Evaluator
The evaluator stage can be built into the read_csv() function body as a simple if and switch filter:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#pragma once | |
#include "data_frame.h" | |
namespace R { | |
// TODO: column names and classes | |
data_frame read_csv(std::string file_path, bool has_header = true); | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include "read_csv.h" | |
#include "tokenize.h" | |
#include <fstream> | |
#include <sstream> | |
#include <iostream> | |
namespace R { | |
data_frame read_csv(std::string file_path, bool has_header) { | |
data_frame d; | |
try { | |
std::ifstream file(file_path); | |
if (!file.is_open()) { | |
throw std::ifstream::failure(file_path); | |
} | |
file.exceptions (std::ifstream::badbit); | |
std::string line; | |
size_t nline{ 1 }, nfield{ 0 }; // line number n, field number nn | |
std::vector<std::string> column; | |
if (has_header) { | |
std::getline(file, line); | |
std::istringstream iss(line); | |
std::string field; | |
while (getline(iss, field, ',')) { | |
if (tokenize(field) == token_t::string_t) { | |
field = field.substr(1, field.size() - 2); // chop enclosing sigils | |
column.push_back(field); // collect column name | |
d[field]; // construct empty column | |
} | |
else { | |
throw std::runtime_error( | |
"record " + std::to_string(nline) + " field " + std::to_string(nfield) + " malformed string " + field | |
); | |
} | |
nfield++; | |
} | |
nline++; | |
} | |
while (std::getline(file, line)) { | |
std::istringstream iss(line); | |
std::string field; | |
nfield = 0; | |
while (getline(iss, field, ',')) { | |
switch (tokenize(field)) { | |
case token_t::logical_t: | |
d[column[nfield]].push_back((field == "true") ? true : false); | |
break; | |
case token_t::integer_t: | |
d[column[nfield]].push_back(stoi(field)); | |
break; | |
case token_t::numeric_t: | |
d[column[nfield]].push_back(stod(field)); | |
break; | |
case token_t::complex_t: | |
// TODO: complex_t | |
break; | |
case token_t::date_t: | |
// TODO: date_t | |
case token_t::string_t: | |
d[column[nfield]].push_back(field.substr(1, field.size() - 2)); | |
break; | |
case token_t::raw_t: | |
d[column[nfield]].push_back(field.substr(1, field.size() - 2)); | |
break; | |
case token_t::broken_t: | |
throw std::runtime_error( | |
file_path + " broken record on line " + std::to_string(nline) + " field " + std::to_string(nfield) + " : " + field | |
); | |
break; | |
} | |
nfield++; | |
} | |
nline++; | |
} | |
} | |
catch (const std::exception& e) { | |
std::cerr << e.what() << std::endl; | |
} | |
return d; | |
} | |
} |
As yet, the read_csv() function does not implement the full functionality of the R read.csv() function i.e. column names and classes
Printing the first part of a data_frame
head()
The C++ version of the R head() function is straightforward:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#pragma once | |
#include "data_structures.h" | |
namespace R { | |
/** | |
* Returns the first n items of a variant vector | |
*/ | |
variant_vector head(const variant_vector& x, size_t n = 6); | |
/** | |
* Returns the first n rows of a data frame | |
*/ | |
data_frame head(const data_frame& x, size_t n = 6); | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include "head.h" | |
namespace R { | |
variant_vector head(const variant_vector& x, size_t n) { | |
return variant_vector (x.begin(), (x.size() < n) ? x.end() : x.begin() + 6); | |
} | |
data_frame head(const data_frame& x, size_t n) { | |
data_frame df; | |
for (const auto& [key, vctr] : x) { // keys as column headings | |
df[key] = head(vctr, n); | |
} | |
return df; | |
} | |
} |
Converting a column of integers or strings into an R-ish date type
Conversion to dates will require a specific date type. However, unlike R where dates are represented as the number of days since 1970-01-01, with negative values for earlier dates, I have chosen std:tm which is a more intuitive structure holding a calendar date and time broken down into its components.
r_date
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#pragma once | |
#include <string> | |
#include <ctime> | |
#include <ostream> | |
namespace R { | |
/** | |
* An R-ish calender date type | |
* This is the type to use if you have only dates, but no times, in your data. | |
* Default format is "%Y-%m-%d" e.g. 2020 - 09 - 30 | |
* functions: as_dates, diffdates, +, -, etc | |
*/ | |
struct r_date { | |
std::tm tm{}; | |
std::string format{}; | |
}; | |
} | |
std::ostream& operator<<(std::ostream& os, const R::r_date& date); | |
bool operator == (const R::r_date& lhs, const R::r_date& rhs); | |
bool operator < (const R::r_date& lhs, const R::r_date& rhs); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include "r_date.h" | |
#include <iomanip> | |
#include <cassert> | |
#include <algorithm> | |
std::ostream& operator<<(std::ostream& os, const R::r_date& date) { | |
os << std::put_time(&date.tm, date.format.c_str()); | |
return os; | |
} | |
bool operator == (const R::r_date& lhs, const R::r_date& rhs) { | |
auto t1 = std::mktime(const_cast<tm*>(&lhs.tm)); | |
auto t2 = std::mktime(const_cast<tm*>(&rhs.tm)); | |
// -1 if time cannot be represented as std::time_t | |
assert(t1 != -1 && t2 != -1); | |
return t1 == t2; | |
} | |
bool operator < (const R::r_date& lhs, const R::r_date& rhs) { | |
auto t1 = std::mktime(const_cast<tm*>(&lhs.tm)); | |
auto t2 = std::mktime(const_cast<tm*>(&rhs.tm)); | |
// -1 if time cannot be represented as std::time_t | |
assert(t1 != -1 && t2 != -1); | |
return std::difftime(t2, t1) > 0; | |
} |
as_dates()
With a date type in hand the C++ version of the R as.Date() function becomes:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#pragma once | |
#include "data_structures.h" | |
namespace R { | |
/** | |
* @brief (R-ish) as_dates convert between string representations and objects of type r_date representing | |
* calendar dates. | |
* | |
* @param dates | |
* @param format - an override std::get_time compatible format string | |
*/ | |
variant_vector as_dates(const variant_vector& dates, std::string format); | |
/** | |
* @brief (R-ish) as_dates convert between string representations and objects of type r_date representing | |
* calendar dates. | |
* | |
* @param dates | |
* @param format - an override std::get_time compatible format string b | |
*/ | |
variant_vector as_dates(variant_vector&& dates, std::string format); | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include "as_dates.h" | |
#include <iomanip> | |
#include <sstream> | |
#include <stdexcept> | |
namespace R { | |
variant_vector as_dates(const variant_vector& dates, std::string format) { | |
variant_vector tm_dates; | |
for (const auto& date : dates) { | |
r_string d; | |
switch (date.index()) { | |
case _str: | |
d = std::get<_str>(date); | |
break; | |
case _int: | |
d = std::to_string(std::get<_int>(date)); | |
break; | |
default: | |
throw std::invalid_argument(std::string(__func__) + " invalid argument " + index_to_string[date.index()]); | |
} | |
std::istringstream ss(d); | |
std:tm tm; | |
std::time_t t = std::time(nullptr); | |
tm = *std::localtime(&t); // ensure no tm elements are undefined, or mktime will fail | |
ss >> std::get_time(&tm, format.c_str()); | |
r_date tm_date; | |
tm_date.format = format; | |
tm_date.tm = tm; | |
tm_dates.push_back(tm_date); | |
} | |
return tm_dates; | |
} | |
variant_vector as_dates(variant_vector&& dates, std::string format) { | |
variant_vector tm_dates; | |
for (const auto& date : dates) { | |
r_string d; | |
switch (date.index()) { | |
case _str: | |
d = std::get<_str>(date); | |
break; | |
case _int: | |
d = std::to_string(std::get<_int>(date)); | |
break; | |
default: | |
throw std::invalid_argument(std::string(__func__) + " invalid argument " + index_to_string[date.index()]); | |
} | |
std::istringstream ss(d); | |
std:tm tm; | |
std::time_t t = std::time(nullptr); | |
tm = *std::localtime(&t); // ensure no tm elements are undefined, or mktime will fail | |
ss >> std::get_time(&tm, format.c_str()); | |
r_date tm_date; | |
tm_date.format = format; | |
tm_date.tm = tm; | |
tm_dates.push_back(tm_date); | |
} | |
return tm_dates; | |
} | |
} |
Example Usage
Bringing it all together:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <iostream> | |
#include "data_frame.h" | |
#include "read_csv.h" | |
#include "head.h" | |
#include "as_dates.h" | |
using namespace R; | |
int main() { | |
std::cout << "read dumb data into heterogeneous container\n\n"; | |
auto car_data = read_csv("mpg.csv"); | |
try { | |
car_data["year"] = as_dates({ car_data["year"] }, "%Y"); | |
} | |
catch (const std::exception& e) { | |
std::cerr << e.what() << "\n\n"; | |
} | |
std::cout << head(car_data); | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
read dumb data into heterogeneous container | |
manufacturer model hwy displ year cyl drv cty trans fl class | |
(str) (str) (int) (num) (date) (int) (str) (int) (str) (str) (str) | |
0 audi a4 29 1.8 1999 4 f 18 auto(l5) p compact | |
1 audi a4 29 1.8 1999 4 f 21 manual(m5) p compact | |
2 audi a4 31 2 2008 4 f 20 manual(m6) p compact | |
3 audi a4 30 2 2008 4 f 21 auto(av) p compact | |
4 audi a4 26 2.8 1999 6 f 16 auto(l5) p compact | |
5 audi a4 26 2.8 1999 6 f 18 manual(m5) p compact |
No comments:
Post a Comment