Aggregation

Note

The following applies currently only to the DataFrame class. Aggregation with a ListOfDicts is simpler and covered by the API-documentation on ListOfDicts.aggregate().

By aggregation, we refer to splitting a data frame into groups based on the values of one or more columns and then calculating group-wise summaries, such total count or mean of a column. The first step is called group_by and the second aggregate, usually written via method chaining as data.group_by(...).aggregate(...).

A simple example below of how to calculate the total count and mean price of AirBnb listings in New York grouped by neighbourhood. The aggregate method takes keyword arguments of the function to be used to calculate the summary and the name of the column for that summary in the output. The return value is a regular data frame. See the following sections for what kinds of aggregation functions you can use.

>>> import dataiter as di
>>> data = di.read_csv("data/listings.csv")
>>> data.group_by("hood").aggregate(n=di.count(), price=di.mean("price"))
.
           hood     n   price
           <U13 int64 float64
  ───────────── ───── ───────
0         Bronx  1198  90.176
1      Brooklyn 19931 125.056
2     Manhattan 21963 218.855
3        Queens  6068  99.745
4 Staten Island   370 116.908

Common Aggregation Functions

Dataiter includes ready functions for the most common summaries that you might want to calculate. These are technically function factories, i.e. they are functions that return functions, that will then be called group-wise within the aggregate method. For example, di.mean("price") returns a function, that given a data frame, returns the group-wise mean of the “price” column. The supported functions are listed below.

These common aggregation functions are provided for two reasons: (1) they provide shorter, more convenient syntax than typing out lambda functions and (2) they allow a huge conditional speed up under the hood. The relevant caveat here is that they work only for single column calculations. If you need to use multiple columns, such as for calculating a weighted mean, see the next section on using arbitrary lambda functions. And see the last section on when and how you can benefit from the huge speed ups that these functions provide.

Arbitrary Aggregation

If you need to access multiple columns in aggregation or you need to calculate some more esoteric summaries than what you can accomplish with the above, then you’ll need to use custom lambda functions. These functions should take a data frame as an argument and return a scalar value. The aggregate method will then apply your lambda functions group-wise.

Repeating the example up top, below is how you’d do the same with lambda functions. Notice that the code needed is a bit more verbose and if you try this with a data frame that has a large amount of groups (around 100,000 or more), you’ll notice that it gets a bit slow, but for more common sizes of input, it should be well usable.

>>> import dataiter as di
>>> data = di.read_csv("data/listings.csv")
>>> data.group_by("hood").aggregate(n=lambda x: x.nrow, price=lambda x: x.price.mean())
.
           hood     n   price
           <U13 int64 float64
  ───────────── ───── ───────
0         Bronx  1198  90.176
1      Brooklyn 19931 125.056
2     Manhattan 21963 218.855
3        Queens  6068  99.745
4 Staten Island   370 116.908

Going Fast with Numba

The common aggregation functions listed above are implemented in Dataiter as both pure Python code (slow) and JIT-compiled Numba code (fast). If you have Numba installed and importing it succeeds, then Dataiter will automatically use it for aggregation involving boolean, integer, float, date, and datetime columns. If Numba is not available, Dataiter will automatically fall back on the slower pure Python implementations. The result should be the same, whether Numba is used or not, excluding some minor rounding or float precision differences.

Numba is currently not a hard dependency of Dataiter, so you’ll need to install it separately:

pip install -U numba

When, for a particular version of Dataiter, you first use a Numba-accelerated aggregation function, the code will be compiled, which might take a couple seconds. The compiled code is saved in cache. After that, using the function from cache will be really fast. In case you’re benchmarking something, note also that on the first use of such a function in a Python session, the compiled code is loaded from cache on disk, which takes something like 10–100 ms and further calls will be faster as there’s no more need to load anything.

Note

If you have trouble with Numba, please check the value of di.USE_NUMBA to see if Numba has been found. You can also set di.USE_NUMBA = False if you have Numba installed, but it’s not working right, or via the environment variable DATAITER_USE_NUMBA=false. Sometimes it’s the just the caching part of Numba that’s causing issues. When upgrading you might sometimes need to delete old caches. If that doesn’t help, you can also turn caching off with di.USE_NUMBA_CACHE = False or the environment variable DATAITER_USE_NUMBA_CACHE=false.