Reading List

The Selfish Gene
The Psychopath Test: A Journey Through the Madness Industry
Bad Science
The Feynman Lectures on Physics
The Theory of Everything: The Origin and Fate of the Universe


ifknot's favorite books »

Saturday 5 September 2020

A heterogeneous data frame in C++

 

Organise your data the R way

NEW! - more stuff and in a github repo - https://github.com/ifknot/data_frame

I like R for statistics. The variables in R are lexically scoped and dynamically typed. 

I like C++ for just about everything else. C++ is a strongly typed language and it is also statically-typed; every object has a type and that type never changes.

I want to do some simple statistics in C++ but I can't imagine doing that without a heterogenous Data Frame.

I want to be able to do what I do in R - desiderata:


But in C++ - ipsa:


It does this (unlike R in C++, indexing begins from 0)


Here's how...

We need a C++ heterogenously typed data frame - i.e. a table, a two-dimensional array-like structure, in which each column contains values of one variable and each row contains one set of values from each column - all indexed by column names and row numbers.

Since C++17 we can use std::variant which gives not only the traditional union of types that we might first think of using, rather the class template std::variant represents a type-safe union.

So, we need a union of basic types similar to R.

using basic_data_types = std::variant<char, double, int, std::string, std::tm>;

(I've chosen std::tm as a convenient data type for time and date but really we could do with a bespoke Date type, to mimic R, based on epoch that permits simple arithmetic.)

R uses vectors for data and C++ has those but we need to semantically mark it is a heterogeneous vector of our basic types.

using variant_vector = std::vector<basic_data_types>;

Next, we need a table made up of columns of our variant_vector that has std::string column names and (zero based as is the C++ way) integer row numbers.

using data_frame = std::unordered_map<std::string, variant_vector>;

To get the ball rolling for our desiderata lets also mimic the R as.Date function, ...

variant_vector as_dates(std::vector<std::string> dates);

... and the R print function - but of course using C++ idioms.

std::ostream& operator<<(std::ostream& os, const R::data_frame& df);

(you can see that I have popped everything into a namespace called "R")

Although easy to build a data frame it is slightly more convoluted to access data and does require prior knowledge of the column's data type.

auto money = std::get<double>(d["salary"][1]);

That's about it really, it works as expected and is useful.

TODO: 

It needs a csv file reader to load tables in dynamically, but I will save that for another day...


Code:




photo credit: justgrimes data (scrabble) via photopin (license)

2 comments:

  1. This is an alternative approach to building a heterogeneous DataFrame in C++:
    https://github.com/hosseinmoein/DataFrame

    ReplyDelete
  2. Thank Hossein, very interesting - mine is less about the data frame and more about using R style idioms in C++ - I put it in repo:

    https://github.com/ifknot/data_frame

    ReplyDelete