cpp: Reading dumb data into the C++ heterogeneous data

Ingest some CSV

Last month I presented an R-ish data_frame class as a small side project, this month I present a C++ equivalent of the R read.csv()function to import data into a data_frame class.

Ongoing development is at the github repo: https://github.com/ifknot/rpp

To recap the motivation for the C++ heterogenous data_frame class was two-fold:

Runtime handling of dumb data whose format, types, and fields are unknown.
Enable data science skill transfer from the R functional programming environment into C++.

The motivation for the read_csv()function remains the same as that for heterogenous the data_frame class:

"I want to be able to do the same sort of thing that I do R, but in C++".

Which, this time around, means that I want to be able to use one of the easiest and most reliable ways of getting data in - text files.

In particular CSV (comma-separated values) files. The CSV file format uses commas to separate the different elements in a line, and each line of data is in its own line in the text file, which makes CSV files ideal for representing tabular data - i.e. the data_frame class.

R comes with a healthy supply of inbuilt data sets to practice with, and so, for development purposes I have borrowed the mpg.csv dataset of miles per gallon (mpg) performance data for a range of make and models of cars.

I want to be able to do what I do in R - desiderata:

But in C++ - ipsa:

It does this (unlike R in C++, indexing begins from 0)

Unfortunately, not only is the tabulation not smart, the date is not read correctly either! This would be solved in R by converting the column using the R as.Date() function. Therefore, a C++ equivalent of the as.Dates() function will also be required.

Parsing the CSV input file

read_csv()

The R read.csv() function is a wrapper function for read.table() that mandates a comma as seperator and uses the input file's first line as header that specifies the table's column names. However, until such time as I develop this more the C++ read_csv() function is embodied in its own right.

1. Tokenization

At this stage the tokenizer is not, yet, equipped to handle dates or complex numbers:

2. Evaluator

The evaluator stage can be built into the read_csv() function body as a simple if and switch filter:

As yet, the read_csv() function does not implement the full functionality of the R read.csv() function i.e. column names and classes

Printing the first part of a data_frame

head()

The C++ version of the R head() function is straightforward:

Converting a column of integers or strings into an R-ish date type

Conversion to dates will require a specific date type. However, unlike R where dates are represented as the number of days since 1970-01-01, with negative values for earlier dates, I have chosen std:tm which is a more intuitive structure holding a calendar date and time broken down into its components.

r_date

as_dates()

With a date type in hand the C++ version of the R as.Date() function becomes:

Example Usage

Bringing it all together:

Now the dates a converted and displayed properly, but there is still some smart tabulating to add at some point...

photo credit: justgrimes data (scrabble) via photopin (license)

cpp

Reading List

Saturday, 17 October 2020

Reading dumb data into the C++ heterogeneous data_frame