Organise your data the R way
NEW! - more stuff and in a github repo - https://github.com/ifknot/data_frame
I like R for statistics. The variables in R are lexically scoped and dynamically typed.
I like C++ for just about everything else. C++ is a strongly typed language and it is also statically-typed; every object has a type and that type never changes.
I want to do some simple statistics in C++ but I can't imagine doing that without a heterogenous Data Frame.
I want to be able to do what I do in R - desiderata:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Create the data frame. | |
emp.data <- data.frame( | |
emp_id = c (1:5), | |
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), | |
salary = c(623.3,515.2,611.0,729.0,843.25), | |
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", | |
"2015-03-27")), | |
stringsAsFactors = FALSE | |
) | |
# Print the data frame. | |
print(emp.data) |
But in C++ - ipsa:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <iostream> | |
#include "r_data_frame.h" | |
int main() { | |
std::cout << "heterogeneous container\n\n"; | |
R::data_frame d; | |
d["id"] = { 1, 2, 3, 4, 5 }; | |
d["name"] = { "Rick", "Dan", "Michelle", "Ryan", "Gary" }; | |
d["salary"] = { 623.3, 515.2, 611.0, 729.0, 843.25 }; | |
d["start_date"] = R::as_dates({ "2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27" }); | |
// print out data table | |
std::cout << d << '\n'; | |
// accessing data does need prior knowledge of the column data type | |
auto money = std::get<double>(d["salary"][1]); | |
// but C++ is strongly typed so there we go | |
std::cout << std::get<std::string>(d["name"][1]) << " earns $" << money << "\n\n"; | |
std::cout << d["name"] << '\n'; | |
} |
It does this (unlike R in C++, indexing begins from 0)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
heterogeneous container | |
id name salary start_date | |
0 1 Rick 623.3 2012-01-01 | |
1 2 Dan 515.2 2013-09-23 | |
2 3 Michelle 611 2014-11-15 | |
3 4 Ryan 729 2014-05-11 | |
4 5 Gary 843.25 2015-03-27 | |
Dan earns $515.2 | |
Rick Dan Michelle Ryan Gary |
Here's how...
We need a C++ heterogenously typed data frame - i.e. a table, a two-dimensional array-like structure, in which each column contains values of one variable and each row contains one set of values from each column - all indexed by column names and row numbers.
Since C++17 we can use std::variant which gives not only the traditional union of types that we might first think of using, rather the class template std::variant represents a type-safe union.
So, we need a union of basic types similar to R.
using basic_data_types = std::variant<char, double, int, std::string, std::tm>;
(I've chosen std::tm as a convenient data type for time and date but really we could do with a bespoke Date type, to mimic R, based on epoch that permits simple arithmetic.)R uses vectors for data and C++ has those but we need to semantically mark it is a heterogeneous vector of our basic types.
using variant_vector = std::vector<basic_data_types>;
Next, we need a table made up of columns of our variant_vector that has std::string column names and (zero based as is the C++ way) integer row numbers.
using data_frame = std::unordered_map<std::string, variant_vector>;
To get the ball rolling for our desiderata lets also mimic the R as.Date function, ...
variant_vector as_dates(std::vector<std::string> dates);
... and the R print function - but of course using C++ idioms.
std::ostream& operator<<(std::ostream& os, const R::data_frame& df);
(you can see that I have popped everything into a namespace called "R")
Although easy to build a data frame it is slightly more convoluted to access data and does require prior knowledge of the column's data type.
auto money = std::get<double>(d["salary"][1]);
That's about it really, it works as expected and is useful.
TODO:
Code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#pragma once | |
#include <variant> | |
#include <string> | |
#include <vector> | |
#include <unordered_map> | |
#include <iostream> | |
#include <iomanip> | |
#include <sstream> | |
namespace R { | |
using basic_data_types = std::variant<char, double, int, std::string, std::tm>; | |
using variant_vector = std::vector<basic_data_types>; | |
/** | |
* As per R language definition the following are the characteristics of a data frame | |
* | |
* - The column names should be non-empty. | |
* - The row names should be unique. | |
* - The data stored in a data frame can be of numeric, factor or character type. | |
* - Each column should contain same number of data items. | |
*/ | |
using data_frame = std::unordered_map<std::string, variant_vector>; | |
variant_vector as_dates(std::vector<std::string> dates); | |
} | |
std::ostream& operator<<(std::ostream& os, const std::tm& tm); | |
std::ostream& operator<<(std::ostream& os, const R::variant_vector& vv); | |
std::ostream& operator<<(std::ostream& os, const R::data_frame& df); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include "r_data_frame.h" | |
namespace R { | |
variant_vector as_dates(std::vector<std::string> source) { | |
R::variant_vector tm_dates; | |
for (const auto& date : source) { | |
std::stringstream ss; | |
ss << date + "T00:00:00Z "; | |
std:tm tm; | |
ss >> std::get_time(&tm, "%Y-%m-%dT%H:%M:%S"); | |
tm_dates.push_back(tm); | |
} | |
return tm_dates; | |
} | |
} | |
std::ostream& operator<<(std::ostream& os, const std::tm& tm) { | |
os << std::put_time(&tm, "%Y-%m-%d"); | |
return os; | |
} | |
std::ostream& operator<<(std::ostream& os, const R::variant_vector& vv) { | |
for (const auto& v : vv) { | |
std::visit([&os](auto&& arg) {os << arg << '\t'; }, v); | |
} | |
return os; | |
} | |
std::ostream& operator<<(std::ostream& os, const R::data_frame& df) { | |
size_t sz; | |
for (const auto& [key, vctr] : df) { | |
os << '\t' << key; | |
sz = vctr.size(); // TODO: check each column contains same number of data items | |
} | |
std::cout << '\n'; | |
for (size_t i{ 0 }; i < sz; ++i) { | |
std::cout << i; | |
for (const auto& [key, vctr] : df) { | |
std::visit([&os](auto&& arg) {os << '\t' << arg; }, vctr[i]); | |
} | |
os << '\n'; | |
} | |
return os; | |
} |
This is an alternative approach to building a heterogeneous DataFrame in C++:
ReplyDeletehttps://github.com/hosseinmoein/DataFrame
Thank Hossein, very interesting - mine is less about the data frame and more about using R style idioms in C++ - I put it in repo:
ReplyDeletehttps://github.com/ifknot/data_frame