DataStreams.jl v0.1

It's been almost exactly a year since my original blog post about DataStreams.jl, so I figure it's time for an update!

There's actually quite a bit to announce and I'm really excited with how things have developed over the last year.

What is DataStreams.jl again?

DataStreams.jl is about designing interfaces for easy and efficient transfer of "table-like" data (i.e. data that can, at least in some sense, be described by rows and columns) between sources and sinks. The key is to provide an interface (i.e. set of required methods to implement) such that as long as a source correctly implements, it can now "stream" data to any existing, valid sink. Similarly, as long as a sink implements the required "sink interface" methods, it should then be able to "receive" data from any source. I gave an intro to the framework in my 2016 JuliaCon talk.

Traditionally, sources and sinks have been developed in isolation, with a source or sink having to provide its own options for interacting with other sources/sinks, usually only a single option (e.g. reading a csv file read in as a dataframe, or loading a csv file into a database). This is pain-staking work, however, to support for each package independently. Imagine as a database middleware developer, do I really want to go through the effort of allowing users to load data into a DB via csv file, sqlite table, dataframe, etc.? A natural first step is to stop and say, "well, I can't possibly support all existing or future types of sources to load into a DB, but I could define a common interface, and as long as a source implements the interface, I can load it into the db". All the sudden, this dev relieves herself of having to be an expert in the most-efficient handling of a specific datasource type, opting instead to allow source developers to code to an interface.

This line of reasoning has been taken even several steps further in the maturation of the DataStreams.jl framework. Independent interfaces exist for both Data.Source types, as well as Data.Sinks. DataStreams.jl itself provides a set of common Data.stream! methods that handle the actual initialization, streaming, and closedown of the data ingestion process, but has and needs no outside knowledge of the specific Source/Sink implementations. For a real-life example, the CSV.jl package now provides a CSV.Source type which implements the Data.Source interface and concerns itself solely with producing table-like data as efficiently as possible. CSV.Sink, on the other hand, implements the Data.Sink interface and only has to worry about storing table-data to disk in the csv format. One of the most powerful results of this Source-Sink decoupling is the fact that adding a new Source or Sink into the ecosystem is O(1), i.e. I don't have to worry about whether my new source type will be compatible with outputting data to 10 different sink types. I define how my source provides data, ensure the Data.Source interface is properly implemented, and boom, my new source can now interact with any existing and future sinks, automatically and efficiently.

For a more technical and in-depth reading of the interfaces for Data.Source or Data.Sink, see the docs here.

New DataStreams Features

In addition to the power of Source-Sink decoupling, having the actual streaming framework defined centrally in DataStreams.jl allows for several other useful features:

  • Appending: oftentimes, a user needs to aggregate several sources into a single sink; as part of the Data.Sink interface, sinks must account for the streaming scenario where new source data will be appended to existing sink data (with matching schema as an obvious requirement)
  • Transforms: often, users need to apply simple data cleansing or manipulating tasks to source data before it's in a useable form to analyze or output to a sink. With the data transfer defined centrally, it was extremely simple to allow the user to supply a set of transform functions that should be applied to the data "in-transit", i.e. as a single field value or column of values is provided by the Data.Source, the transform function is applied before sending to the Data.Sink for storage.
  • Package use consistency: due to the consistency of implementing the Data.Source and Data.Sink interfaces separately, each package has a natural "high-level" convenience method fall out. Such methods for Sources are:
    • CSV.read(file, sink=DataFrame; append=false, transforms=Dict())
    • SQL.query(db, query, sink=DataFrame; append=false, transforms=Dict())
    • Feather.read(file, sink=DataFrame; append=false, transforms=Dict())
    • ODBC.query(dsn, query, sink=DataFrame; append=false, transforms=Dict())

The definition for these convenience methods are almost identical across packages, utilizing the same DataStreams.jl infrastructure, and all allowing the common DataStream features such as specifying any sink argument they'd like (typically a DataFrame by default), whether to append to the sink or not, and any transform functions that should be applied along the way. These convenience methods work by constructing Data.Source (and possibly Data.Sink) types "on-the-fly" before calling Data.stream!(source, sink) and Data.close!(sink) to perform the actual data transfer.

Similar convenience methods for Data.Sink are:

  • CSV.write(file, source)
  • SQLite.load(db, tablename, source)
  • Feather.write(file, source)
  • ODBC.load(dsn, tablename)

with all supporting the same append and transforms keyword arguments as well.

What's next:

Things on my mind and that will be taking up my mind include:

  • DataFrames modernizing: there's some major work already underway to improve the core DataFrame type for handling tabular data in Julia
  • Multi-source/sink streaming: though the framework already supports appending when streaming, there's more in specific use-cases that could be supported by DataStreams centrally
  • More formal and powerful transforms: I feel like we've only scratched the surface for transforms so far; I don't think it'll be too difficult to support thinks like: new-column transforms, multi-column transforms, and possibly aggregation transforms.
  • More Sources & Sinks!: I hope to help enable as many as possible, such as SAS, RData, certain geo-data formats, etc. Stream all the things!!

Finally, I definitely want to give shoutouts to those who have helped, with code, design, or otherwise: David Gold, John Myles White, Milan Bouchet-Valat, David Anthoff, and Yeesian Ng for their direct collaboration and help, as well as many of the core contributors to julia itself. And special shoutouts to Tom Breloff, Jiahao Chen, Cameron McBride, and Jeff Bezanson for specific discussions and suggestions at JuliaCon 2016 and since.

I'm also including some quick "new release news" for individual packages below, since there have been quite a few new features added:

  • CSV.jl

    • Docs
    • Supports a wide variety of delimited file options such as delim, quotechar, escapechar, custom null strings; a header can be provided manually or on a specified row or range of rows; types can be provided manually, and results can be requested as nullable or not (nullable=true by default); and the # of rows can be provided manually (if known) for efficiency.
    • CSV.parsefield(io::IO, ::Type{T}) can be called directly on any IO type to tap into the delimited-parsing functionality manually
  • SQLite.jl

    • Docs
    • Query results will now use the declared table column type by default, which can help resultset column typing in some cases
    • Parameterized SQL statements are fully supported, with the ability to bind julia values to be sent to the DB
    • Full serialization/deserialization of native and custom Julia types is supported; so Complex{Int128} can be stored in its own SQLite table column and retrieved without any issue
    • Pure Julia scalar and aggregation functions can be registered with an SQLite database and then called from within SQL statements: full docs here
  • Feather.jl
    • Docs
    • Full support for feather release v0.3.0 to ensure compatibility
    • Full support for returning "factor" or "category" type columns as native CategoricalArray and NullableCategoricalArray types in Julia, thanks to the new CategoricalArrays.jl package
    • nullable::Bool=true keyword argument; if false, columns without null values will be returned as Vector{T} instead of NullableVector{T}
    • Feather.Sink now supports appending, so multiple DataFrames or CSV.Source or any Data.Source can all be streamed to a single feather file
  • ODBC.jl
    • Docs
    • A new ODBC.DSN type that represents a valid, open connection to a database; used in all subsequent api calls; it can be constructed using a previously configured system/user dsn w/ username and password, or as a full custom connection string
    • Full support for the DataStreams.jl framework through the ODBC.Source and ODBC.Sink types, along with their high-level convenience methods ODBC.query and ODBC.load
    • A new ODBC.prepare(dsn, sql) => ODBC.Statement method which can send an sql statement to the database to be compiled and planned before executed 1 or more times. SQL statements can include parameters to be prepared that can have dynamic values bound before each execution.