Reading List

The Selfish Gene
The Psychopath Test: A Journey Through the Madness Industry
Bad Science
The Feynman Lectures on Physics
The Theory of Everything: The Origin and Fate of the Universe


ifknot's favorite books »

Saturday 17 October 2020

Reading dumb data into the C++ heterogeneous data_frame

 

Ingest some CSV

Last month I presented an R-ish data_frame class as a small side project, this month I present a C++ equivalent of the R read.csv()function to import data into a data_frame class.

Ongoing development is at the github repo: https://github.com/ifknot/rpp

To recap the motivation for the C++ heterogenous  data_frame class was two-fold:
  1. Runtime handling of dumb data whose format, types, and fields are unknown.
  2. Enable data science skill transfer from the R functional programming environment into C++.
The motivation for the read_csv()function remains the same as that for heterogenous the  data_frame class:

"I want to be able to do the same sort of thing that I do R, but in C++".

Which, this time around, means that I want to be able to use one of the easiest and most reliable ways of getting data in - text files.

In particular CSV (comma-separated values) files. The CSV file format uses commas to separate the different elements in a line, and each line of data is in its own line in the text file, which makes CSV files ideal for representing tabular data - i.e. the  data_frame class.

R comes with a healthy supply of inbuilt data sets to practice with, and so, for development purposes I have borrowed the mpg.csv dataset of miles per gallon (mpg) performance data for a range of make and models of cars. 

I want to be able to do what I do in R - desiderata:

But in C++ - ipsa:

It does this (unlike R in C++, indexing begins from 0)

Unfortunately, not only is the tabulation not smart, the date is not read correctly either! This would be solved in R by converting the column using the R as.Date() function. Therefore, a C++ equivalent of the as.Dates() function will also be required. 

Parsing the CSV input file

read_csv()

The R read.csv() function is a wrapper function for read.table() that mandates a comma as seperator and uses the input file's first line as header that specifies the table's column names. However, until such time as I develop this more the C++ read_csv() function is embodied in its own right.


1. Tokenization

At this stage the tokenizer is not, yet, equipped to handle dates or complex numbers:


2. Evaluator

The evaluator stage can be built into the read_csv() function body as a simple if and switch filter:

As yet, the read_csv() function does not implement the full functionality of the R read.csv() function i.e. column names and classes

Printing the first part of a data_frame

head()

The C++ version of the R head() function is straightforward:


 Converting a column of integers or strings into an R-ish date type

Conversion to dates will require a specific date type. However, unlike R  where dates are represented as the number of days since 1970-01-01, with negative values for earlier dates, I have chosen std:tm which is a more intuitive structure holding a calendar date and time broken down into its components.

r_date



as_dates()

With a date type in hand the C++ version of the R as.Date() function becomes:


Example Usage

Bringing it all together:

Now the dates a converted and displayed properly, but there is still some smart tabulating to add at some point...


photo credit: justgrimes data (scrabble) via photopin (license)




No comments:

Post a Comment