dataiter¶

The following functions are shorthand helpers for use in conjunction with DataFrame.aggregate(), see the guide on aggregation for details.

all() any() count() count_unique() first() last() max() mean() median() min() mode() nth() quantile() std() sum() var()

The following read functions are convenience aliases to the correspoding methods of the classes generally most suitable for the particular file type, i.e. DataFrame for CSV, NPZ and Parquet, GeoJSON for GeoJSON and ListOfDicts for JSON.

read_csv() read_geojson() read_json() read_npz() read_parquet()

dataiter.all(x)[source]¶

Return whether all elements of x evaluate to True.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

Uses numpy.all, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.all.html

>>> di.all(di.Vector([True, False]))
False
>>> di.all(di.Vector([True, True]))
True
>>> di.all("x")
<function all.<locals>.aggregate at 0x7f5bbb4d44a0>

dataiter.any(x)[source]¶

Return whether any element of x evaluates to True.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

Uses numpy.any, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.any.html

>>> di.any(di.Vector([False, False]))
False
>>> di.any(di.Vector([True, False]))
True
>>> di.any("x")
<function any.<locals>.aggregate at 0x7fe67d4644a0>

dataiter.count(x='', *, drop_na=False)[source]¶

Return the amount of elements in x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x. Since all columns in a data frame should have the same amount of elements (i.e. rows), you can just leave the x argument at its default blank string, which will give you that row count.

>>> di.count(di.Vector([1, 2, 3]))
3
>>> di.count()
<function count.<locals>.aggregate at 0x7f6ba6a184a0>

dataiter.count_unique(x, *, drop_na=False)[source]¶

Return the amount of unique elements in x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

>>> di.count_unique(di.Vector([1, 2, 2, 3, 3, 3]))
3
>>> di.count_unique("x")
<function count_unique.<locals>.aggregate at 0x7f6f33f104a0>

dataiter.first(x, *, drop_na=False)[source]¶

Return the first element of x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

>>> di.first(di.Vector([1, 2, 3]))
1
>>> di.first("x")
<function nth.<locals>.aggregate at 0x7ff1050dc4a0>

dataiter.last(x, *, drop_na=False)[source]¶

Return the last element of x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

>>> di.last(di.Vector([1, 2, 3]))
3
>>> di.last("x")
<function nth.<locals>.aggregate at 0x7fb0418704a0>

dataiter.max(x, *, drop_na=True)[source]¶

Return the maximum of elements in x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

>>> di.max(di.Vector([4, 5, 6]))
6
>>> di.max("x")
<function max.<locals>.aggregate at 0x7fd1dbe884a0>

dataiter.mean(x, *, drop_na=True)[source]¶

Return the arithmetic mean of x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

Uses numpy.mean, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.mean.html

>>> di.mean(di.Vector([1, 2, 10]))
4.333333333333333
>>> di.mean("x")
<function mean.<locals>.aggregate at 0x7f72dfd684a0>

dataiter.median(x, *, drop_na=True)[source]¶

Return the median of x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

Uses numpy.median, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.median.html

>>> di.median(di.Vector([5, 1, 2]))
2.0
>>> di.median("x")
<function median.<locals>.aggregate at 0x7fd8c67944a0>

dataiter.min(x, *, drop_na=True)[source]¶

Return the minimum of elements in x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

>>> di.min(di.Vector([4, 5, 6]))
4
>>> di.min("x")
<function min.<locals>.aggregate at 0x7fcab52844a0>

dataiter.mode(x, *, drop_na=True)[source]¶

Return the most common value in x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

>>> di.mode(di.Vector([1, 2, 2, 3, 3, 3]))
3
>>> di.mode("x")
<function mode.<locals>.aggregate at 0x7faac55c44a0>

dataiter.nth(x, index, *, drop_na=False)[source]¶

Return the element of x at index (zero-based).

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

>>> di.nth(di.Vector([1, 2, 3]), 1)
2
>>> di.nth("x", 1)
<function nth.<locals>.aggregate at 0x7fb3b3dbc4a0>

dataiter.quantile(x, q, *, drop_na=True)[source]¶

Return the qth quantile of x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

Uses numpy.quantile, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.quantile.html

>>> di.quantile(di.Vector([1, 5, 6]), 0.5)
5.0
>>> di.quantile("x", 0.5)
<function quantile.<locals>.aggregate at 0x7f3ee11b04a0>

dataiter.read_csv(path, *, encoding='utf-8', sep=',', header=True, columns=[], strings_as_object=inf, dtypes={})[source]¶

Return a new data frame from CSV file path.

Will automatically decompress if path ends in .bz2|.gz|.xz.

columns is an optional list of columns to limit to.

strings_as_object is a cutoff point. If any row has more characters than that, the whole column will use the object data type. This is intended to help limit memory use as NumPy strings are fixed-length and can take a huge amount of memory if even a single row is long. If set, dtypes overrides this.

dtypes is an optional dict mapping column names to NumPy datatypes.

Note

read_csv() is a convenience alias for DataFrame.read_csv().

dataiter.read_geojson(path, *, encoding='utf-8', columns=[], strings_as_object=inf, dtypes={}, **kwargs)[source]¶

Return data from GeoJSON file path.

Will automatically decompress if path ends in .bz2|.gz|.xz.

columns is an optional list of columns to limit to.

strings_as_object is a cutoff point. If any row has more characters than that, the whole column will use the object data type. This is intended to help limit memory use as NumPy strings are fixed-length and can take a huge amount of memory if even a single row is long. If set, dtypes overrides this.

dtypes is an optional dict mapping column names to NumPy datatypes.

kwargs are passed to json.load.

Note

read_geojson() is a convenience alias for GeoJSON.read().

dataiter.read_json(path, *, encoding='utf-8', keys=[], types={}, **kwargs)[source]¶

Return a new list from JSON file path.

Will automatically decompress if path ends in .bz2|.gz|.xz.

keys is an optional list of keys to limit to. types is an optional dict mapping keys to datatypes. kwargs are passed to json.load.

Note

read_json() is a convenience alias for ListOfDicts.read_json().

dataiter.read_npz(path, *, allow_pickle=True)[source]¶

Return a new data frame from NumPy file path.

See numpy.load for an explanation of allow_pickle: https://numpy.org/doc/stable/reference/generated/numpy.load.html

Note

read_npz() is a convenience alias for DataFrame.read_npz().

dataiter.read_parquet(path, *, columns=[], strings_as_object=inf, dtypes={})[source]¶

Return a new data frame from Parquet file path.

columns is an optional list of columns to limit to.

strings_as_object is a cutoff point. If any row has more characters than that, the whole column will use the object data type. This is intended to help limit memory use as NumPy strings are fixed-length and can take a huge amount of memory if even a single row is long. If set, dtypes overrides this.

dtypes is an optional dict mapping column names to NumPy datatypes.

Note

read_parquet() is a convenience alias for DataFrame.read_parquet().

dataiter.std(x, *, ddof=0, drop_na=True)[source]¶

Return the standard deviation of x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

Uses numpy.std, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.std.html

>>> di.std(di.Vector([3, 6, 7]))
1.699673171197595
>>> di.std("x")
<function std.<locals>.aggregate at 0x7fab0fe544a0>

dataiter.sum(x, *, drop_na=True)[source]¶

Return the sum of x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

>>> di.sum(di.Vector([1, 2, 3]))
6
>>> di.sum("x")
<function sum.<locals>.aggregate at 0x7fd7962d04a0>

dataiter.var(x, *, ddof=0, drop_na=True)[source]¶

Return the variance of x.

If x is a string, return a function usable with DataFrame.aggregate() that operates group-wise on column x.

Uses numpy.var, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.var.html

>>> di.var(di.Vector([3, 6, 7]))
2.888888888888889
>>> di.var("x")
<function var.<locals>.aggregate at 0x7fab02e604a0>