Organise your data the R way
NEW! - more stuff and in a github repo - https://github.com/ifknot/data_frame
I like R for statistics. The variables in R are lexically scoped and dynamically typed.
I like C++ for just about everything else. C++ is a strongly typed language and it is also statically-typed; every object has a type and that type never changes.
I want to do some simple statistics in C++ but I can't imagine doing that without a heterogenous Data Frame.
I want to be able to do what I do in R - desiderata:
But in C++ - ipsa:
It does this (unlike R in C++, indexing begins from 0)
Here's how...
We need a C++ heterogenously typed data frame - i.e. a table, a two-dimensional array-like structure, in which each column contains values of one variable and each row contains one set of values from each column - all indexed by column names and row numbers.
Since C++17 we can use std::variant which gives not only the traditional union of types that we might first think of using, rather the class template std::variant represents a type-safe union.
So, we need a union of basic types similar to R.
using basic_data_types = std::variant<char, double, int, std::string, std::tm>;
(I've chosen std::tm as a convenient data type for time and date but really we could do with a bespoke Date type, to mimic R, based on epoch that permits simple arithmetic.)R uses vectors for data and C++ has those but we need to semantically mark it is a heterogeneous vector of our basic types.
using variant_vector = std::vector<basic_data_types>;
Next, we need a table made up of columns of our variant_vector that has std::string column names and (zero based as is the C++ way) integer row numbers.
using data_frame = std::unordered_map<std::string, variant_vector>;
To get the ball rolling for our desiderata lets also mimic the R as.Date function, ...
variant_vector as_dates(std::vector<std::string> dates);
... and the R print function - but of course using C++ idioms.
std::ostream& operator<<(std::ostream& os, const R::data_frame& df);
(you can see that I have popped everything into a namespace called "R")
Although easy to build a data frame it is slightly more convoluted to access data and does require prior knowledge of the column's data type.
auto money = std::get<double>(d["salary"][1]);
That's about it really, it works as expected and is useful.
This is an alternative approach to building a heterogeneous DataFrame in C++:
ReplyDeletehttps://github.com/hosseinmoein/DataFrame
Thank Hossein, very interesting - mine is less about the data frame and more about using R style idioms in C++ - I put it in repo:
ReplyDeletehttps://github.com/ifknot/data_frame