Ingest some CSV
Last month I presented an R-ish data_frame class as a small side project, this month I present a C++ equivalent of the R read.csv()function to import data into a data_frame class.
Ongoing development is at the github repo: https://github.com/ifknot/rpp
To recap the motivation for the C++ heterogenous data_frame class was two-fold:
- Runtime handling of dumb data whose format, types, and fields are unknown.
- Enable data science skill transfer from the R functional programming environment into C++.
The motivation for the read_csv()function remains the same as that for heterogenous the data_frame class:
"I want to be able to do the same sort of thing that I do R, but in C++".
Which, this time around, means that I want to be able to use one of the easiest and most reliable ways of getting data in - text files.
In particular CSV (comma-separated values) files. The CSV file format uses commas to separate the different elements in a line, and each line of data is in its own line in the text file, which makes CSV files ideal for representing tabular data - i.e. the data_frame class.
R comes with a healthy supply of inbuilt data sets to practice with, and so, for development purposes I have borrowed the mpg.csv dataset of miles per gallon (mpg) performance data for a range of make and models of cars.
I want to be able to do what I do in R - desiderata:
But in C++ - ipsa:
It does this (unlike R in C++, indexing begins from 0)
Unfortunately, not only is the tabulation not smart, the date is not read correctly either! This would be solved in R by converting the column using the R as.Date() function. Therefore, a C++ equivalent of the as.Dates() function will also be required.
Parsing the CSV input file
read_csv()
The R read.csv() function is a wrapper function for read.table() that mandates a comma as seperator and uses the input file's first line as header that specifies the table's column names. However, until such time as I develop this more the C++ read_csv() function is embodied in its own right.
1. Tokenization
At this stage the tokenizer is not, yet, equipped to handle dates or complex numbers:
2. Evaluator
The evaluator stage can be built into the read_csv() function body as a simple if and switch filter:
As yet, the read_csv() function does not implement the full functionality of the R read.csv() function i.e. column names and classes
Printing the first part of a data_frame
head()
The C++ version of the R head() function is straightforward:
Converting a column of integers or strings into an R-ish date type
Conversion to dates will require a specific date type. However, unlike R where dates are represented as the number of days since 1970-01-01, with negative values for earlier dates, I have chosen std:tm which is a more intuitive structure holding a calendar date and time broken down into its components.
r_date
as_dates()
With a date type in hand the C++ version of the R as.Date() function becomes:
Example Usage
Bringing it all together:
No comments:
Post a Comment