Aggregation =========== .. note:: The following applies currently only to the :class:`.DataFrame` class. Aggregation with a :class:`.ListOfDicts` is simpler and covered by the API-documentation on :meth:`.ListOfDicts.aggregate`. By aggregation, we refer to splitting a data frame into groups based on the values of one or more columns and then calculating group-wise summaries, such total count or mean of a column. The first step is called ``group_by`` and the second ``aggregate``, usually written via method chaining as ``data.group_by(...).aggregate(...)``. A simple example below of how to calculate the total count and mean price of AirBnb listings in New York grouped by neighbourhood. The ``aggregate`` method takes keyword arguments of the function to be used to calculate the summary and the name of the column for that summary in the output. The return value is a regular data frame. See the following sections for what kinds of aggregation functions you can use. >>> import dataiter as di >>> data = di.read_csv("data/listings.csv") >>> data.group_by("hood").aggregate(n=di.count(), price=di.mean("price")) . hood n price >> import dataiter as di >>> data = di.read_csv("data/listings.csv") >>> data.group_by("hood").aggregate(n=lambda x: x.nrow, price=lambda x: x.price.mean()) . hood n price `_ code (fast). If you have Numba installed and importing it succeeds, then Dataiter will **automatically** use it for aggregation involving **boolean**, **integer**, **float**, **date**, and **datetime** columns. If Numba is not available, Dataiter will automatically fall back on the slower pure Python implementations. The result should be the same, whether Numba is used or not, excluding some minor rounding or float precision differences. Numba is currently not a hard dependency of Dataiter, so you'll need to install it separately:: pip install -U numba When, for a particular version of Dataiter, you first use a Numba-accelerated aggregation function, the code will be compiled, which might take a couple seconds. The compiled code is saved in `cache `_. After that, using the function from cache will be really fast. In case you're benchmarking something, note also that on the first use of such a function in a Python session, the compiled code is loaded from cache on disk, which takes something like 10–100 ms and further calls will be faster as there's no more need to load anything. .. note:: If you have trouble with Numba, please check the value of ``di.USE_NUMBA`` to see if Numba has been found. You can also set ``di.USE_NUMBA = False`` if you have Numba installed, but it's not working right, or via the environment variable ``DATAITER_USE_NUMBA=false``. Sometimes it's the just the `caching `_ part of Numba that's causing issues. When upgrading you might sometimes need to delete old caches. If that doesn't help, you can also turn caching off with ``di.USE_NUMBA_CACHE = False`` or the environment variable ``DATAITER_USE_NUMBA_CACHE=false``.