Reading List

The Selfish Gene
The Psychopath Test: A Journey Through the Madness Industry
Bad Science
The Feynman Lectures on Physics
The Theory of Everything: The Origin and Fate of the Universe


ifknot's favorite books »

Saturday, 5 September 2020

A heterogeneous data frame in C++

 

Organise your data the R way

NEW! - more stuff and in a github repo - https://github.com/ifknot/data_frame

I like R for statistics. The variables in R are lexically scoped and dynamically typed. 

I like C++ for just about everything else. C++ is a strongly typed language and it is also statically-typed; every object has a type and that type never changes.

I want to do some simple statistics in C++ but I can't imagine doing that without a heterogenous Data Frame.

I want to be able to do what I do in R - desiderata:

# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)
view raw data_frame.r hosted with ❤ by GitHub

But in C++ - ipsa:

#include <iostream>
#include "r_data_frame.h"
int main() {
std::cout << "heterogeneous container\n\n";
R::data_frame d;
d["id"] = { 1, 2, 3, 4, 5 };
d["name"] = { "Rick", "Dan", "Michelle", "Ryan", "Gary" };
d["salary"] = { 623.3, 515.2, 611.0, 729.0, 843.25 };
d["start_date"] = R::as_dates({ "2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27" });
// print out data table
std::cout << d << '\n';
// accessing data does need prior knowledge of the column data type
auto money = std::get<double>(d["salary"][1]);
// but C++ is strongly typed so there we go
std::cout << std::get<std::string>(d["name"][1]) << " earns $" << money << "\n\n";
std::cout << d["name"] << '\n';
}

It does this (unlike R in C++, indexing begins from 0)

heterogeneous container
id name salary start_date
0 1 Rick 623.3 2012-01-01
1 2 Dan 515.2 2013-09-23
2 3 Michelle 611 2014-11-15
3 4 Ryan 729 2014-05-11
4 5 Gary 843.25 2015-03-27
Dan earns $515.2
Rick Dan Michelle Ryan Gary
view raw data_frame.txt hosted with ❤ by GitHub

Here's how...

We need a C++ heterogenously typed data frame - i.e. a table, a two-dimensional array-like structure, in which each column contains values of one variable and each row contains one set of values from each column - all indexed by column names and row numbers.

Since C++17 we can use std::variant which gives not only the traditional union of types that we might first think of using, rather the class template std::variant represents a type-safe union.

So, we need a union of basic types similar to R.

using basic_data_types = std::variant<char, double, int, std::string, std::tm>;

(I've chosen std::tm as a convenient data type for time and date but really we could do with a bespoke Date type, to mimic R, based on epoch that permits simple arithmetic.)

R uses vectors for data and C++ has those but we need to semantically mark it is a heterogeneous vector of our basic types.

using variant_vector = std::vector<basic_data_types>;

Next, we need a table made up of columns of our variant_vector that has std::string column names and (zero based as is the C++ way) integer row numbers.

using data_frame = std::unordered_map<std::string, variant_vector>;

To get the ball rolling for our desiderata lets also mimic the R as.Date function, ...

variant_vector as_dates(std::vector<std::string> dates);

... and the R print function - but of course using C++ idioms.

std::ostream& operator<<(std::ostream& os, const R::data_frame& df);

(you can see that I have popped everything into a namespace called "R")

Although easy to build a data frame it is slightly more convoluted to access data and does require prior knowledge of the column's data type.

auto money = std::get<double>(d["salary"][1]);

That's about it really, it works as expected and is useful.

TODO: 

It needs a csv file reader to load tables in dynamically, but I will save that for another day...


Code:

#pragma once
#include <variant>
#include <string>
#include <vector>
#include <unordered_map>
#include <iostream>
#include <iomanip>
#include <sstream>
namespace R {
using basic_data_types = std::variant<char, double, int, std::string, std::tm>;
using variant_vector = std::vector<basic_data_types>;
/**
* As per R language definition the following are the characteristics of a data frame
*
* - The column names should be non-empty.
* - The row names should be unique.
* - The data stored in a data frame can be of numeric, factor or character type.
* - Each column should contain same number of data items.
*/
using data_frame = std::unordered_map<std::string, variant_vector>;
variant_vector as_dates(std::vector<std::string> dates);
}
std::ostream& operator<<(std::ostream& os, const std::tm& tm);
std::ostream& operator<<(std::ostream& os, const R::variant_vector& vv);
std::ostream& operator<<(std::ostream& os, const R::data_frame& df);
view raw r_data_frame.h hosted with ❤ by GitHub

#include "r_data_frame.h"
namespace R {
variant_vector as_dates(std::vector<std::string> source) {
R::variant_vector tm_dates;
for (const auto& date : source) {
std::stringstream ss;
ss << date + "T00:00:00Z ";
std:tm tm;
ss >> std::get_time(&tm, "%Y-%m-%dT%H:%M:%S");
tm_dates.push_back(tm);
}
return tm_dates;
}
}
std::ostream& operator<<(std::ostream& os, const std::tm& tm) {
os << std::put_time(&tm, "%Y-%m-%d");
return os;
}
std::ostream& operator<<(std::ostream& os, const R::variant_vector& vv) {
for (const auto& v : vv) {
std::visit([&os](auto&& arg) {os << arg << '\t'; }, v);
}
return os;
}
std::ostream& operator<<(std::ostream& os, const R::data_frame& df) {
size_t sz;
for (const auto& [key, vctr] : df) {
os << '\t' << key;
sz = vctr.size(); // TODO: check each column contains same number of data items
}
std::cout << '\n';
for (size_t i{ 0 }; i < sz; ++i) {
std::cout << i;
for (const auto& [key, vctr] : df) {
std::visit([&os](auto&& arg) {os << '\t' << arg; }, vctr[i]);
}
os << '\n';
}
return os;
}


photo credit: justgrimes data (scrabble) via photopin (license)

2 comments:

  1. This is an alternative approach to building a heterogeneous DataFrame in C++:
    https://github.com/hosseinmoein/DataFrame

    ReplyDelete
  2. Thank Hossein, very interesting - mine is less about the data frame and more about using R style idioms in C++ - I put it in repo:

    https://github.com/ifknot/data_frame

    ReplyDelete