dataiter¶
The following functions are shorthand helpers for use in conjunction
with DataFrame.aggregate(), see the guide on aggregation for details.
all()
any()
count()
count_unique()
first()
last()
max()
mean()
median()
min()
mode()
nth()
quantile()
std()
sum()
var()
The following read functions are convenience aliases to the correspoding
methods of the classes generally most suitable for the particular file
type, i.e. DataFrame for CSV, NPZ and Parquet,
GeoJSON for GeoJSON and ListOfDicts for JSON.
read_csv()
read_geojson()
read_json()
read_npz()
read_parquet()
The following constants can be used to customize certain defaults, such as formatting and limits for printing.
dataiter.PRINT_MAX_WIDTH
dataiter.PRINT_THOUSAND_SEPARATOR
dataiter.PRINT_TRUNCATE_WIDTH
dataiter.USE_NUMBA
dataiter.USE_NUMBA_CACHE
- dataiter.PRINT_MAX_WIDTH = 80¶
Maximum amount of columns to wrap print output to. This is only a fallback in case Python’s
shutil.get_terminal_sizefails to detect the width of your terminal. By default the detected full width is used.
- dataiter.PRINT_THOUSAND_SEPARATOR = ''¶
Thousand separator to use when printing numbers. By default this is blank, meaning no thousand separators are rendered.
- dataiter.PRINT_TRUNCATE_WIDTH = 36¶
Maximum width to truncate string columns to in
DataFrameprint output. When this is exceeded, strings will be cut and an ellipsis (…) rendered at the cut point.
- dataiter.USE_NUMBA = False¶
Trueto use Numba, if available, to speed up aggregations,Falseto only use pure Python code.
- dataiter.USE_NUMBA_CACHE = True¶
Trueto use Numba cache for JIT-compiled aggregations,Falseto only keep compiled code in memory for the duration of the session.
- dataiter.all(x)[source]¶
Return whether all elements of x evaluate to
True.If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.Uses
numpy.all, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.all.html>>> di.all(di.Vector([True, False])) False >>> di.all(di.Vector([True, True])) True >>> di.all("x") <function all.<locals>.aggregate at 0x7f18bbd74c20>
- dataiter.any(x)[source]¶
Return whether any element of x evaluates to
True.If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.Uses
numpy.any, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.any.html>>> di.any(di.Vector([False, False])) False >>> di.any(di.Vector([True, False])) True >>> di.any("x") <function any.<locals>.aggregate at 0x7f8074d94c20>
- dataiter.count(x='', *, drop_na=False)[source]¶
Return the amount of elements in x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x. Since all columns in a data frame should have the same amount of elements (i.e. rows), you can just leave the x argument at its default blank string, which will give you that row count.>>> di.count(di.Vector([1, 2, 3])) 3 >>> di.count() <function count.<locals>.aggregate at 0x7fd934684c20>
- dataiter.count_unique(x, *, drop_na=False)[source]¶
Return the amount of unique elements in x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.>>> di.count_unique(di.Vector([1, 2, 2, 3, 3, 3])) 3 >>> di.count_unique("x") <function count_unique.<locals>.aggregate at 0x7f22dbca4c20>
- dataiter.first(x, *, drop_na=False)[source]¶
Return the first element of x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.>>> di.first(di.Vector([1, 2, 3])) 1 >>> di.first("x") <function nth.<locals>.aggregate at 0x7fcef9f4cc20>
- dataiter.last(x, *, drop_na=False)[source]¶
Return the last element of x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.>>> di.last(di.Vector([1, 2, 3])) 3 >>> di.last("x") <function nth.<locals>.aggregate at 0x7f769ace4c20>
- dataiter.max(x, *, drop_na=True)[source]¶
Return the maximum of elements in x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.>>> di.max(di.Vector([4, 5, 6])) 6 >>> di.max("x") <function max.<locals>.aggregate at 0x7f3b6e920c20>
- dataiter.mean(x, *, drop_na=True)[source]¶
Return the arithmetic mean of x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.Uses
numpy.mean, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.mean.html>>> di.mean(di.Vector([1, 2, 10])) 4.333333333333333 >>> di.mean("x") <function mean.<locals>.aggregate at 0x7f390eec8c20>
- dataiter.median(x, *, drop_na=True)[source]¶
Return the median of x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.Uses
numpy.median, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.median.html>>> di.median(di.Vector([5, 1, 2])) 2.0 >>> di.median("x") <function median.<locals>.aggregate at 0x7f14f1184c20>
- dataiter.min(x, *, drop_na=True)[source]¶
Return the minimum of elements in x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.>>> di.min(di.Vector([4, 5, 6])) 4 >>> di.min("x") <function min.<locals>.aggregate at 0x7f90421c8c20>
- dataiter.mode(x, *, drop_na=True)[source]¶
Return the most common value in x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.>>> di.mode(di.Vector([1, 2, 2, 3, 3, 3])) 3 >>> di.mode("x") <function mode.<locals>.aggregate at 0x7fa25b30cc20>
- dataiter.nth(x, index, *, drop_na=False)[source]¶
Return the element of x at index (zero-based).
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.>>> di.nth(di.Vector([1, 2, 3]), 1) 2 >>> di.nth("x", 1) <function nth.<locals>.aggregate at 0x7f440d7e4c20>
- dataiter.quantile(x, q, *, drop_na=True)[source]¶
Return the qth quantile of x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.Uses
numpy.quantile, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.quantile.html>>> di.quantile(di.Vector([1, 5, 6]), 0.5) 5.0 >>> di.quantile("x", 0.5) <function quantile.<locals>.aggregate at 0x7f2eb5164c20>
- dataiter.read_csv(path, *, encoding='utf-8', sep=',', header=True, columns=[], dtypes={})[source]¶
Return a new data frame from CSV file path.
Will automatically decompress if path ends in
.bz2|.gz|.xz. columns is an optional list of columns to limit to. dtypes is an optional dict mapping column names to NumPy datatypes.Note
read_csv()is a convenience alias forDataFrame.read_csv().
- dataiter.read_geojson(path, *, encoding='utf-8', columns=[], dtypes={}, **kwargs)[source]¶
Return data from GeoJSON file path.
Will automatically decompress if path ends in
.bz2|.gz|.xz. columns is an optional list of columns to limit to. dtypes is an optional dict mapping column names to NumPy datatypes. kwargs are passed tojson.load.Note
read_geojson()is a convenience alias forGeoJSON.read().
- dataiter.read_json(path, *, encoding='utf-8', keys=[], types={}, **kwargs)[source]¶
Return a new list from JSON file path.
Will automatically decompress if path ends in
.bz2|.gz|.xz. keys is an optional list of keys to limit to. types is an optional dict mapping keys to datatypes. kwargs are passed tojson.load.Note
read_json()is a convenience alias forListOfDicts.read_json().
- dataiter.read_npz(path, *, allow_pickle=True)[source]¶
Return a new data frame from NumPy file path.
See numpy.load for an explanation of allow_pickle: https://numpy.org/doc/stable/reference/generated/numpy.load.html
Note
read_npz()is a convenience alias forDataFrame.read_npz().
- dataiter.read_parquet(path, *, columns=[], dtypes={})[source]¶
Return a new data frame from Parquet file path.
columns is an optional list of columns to limit to. dtypes is an optional dict mapping column names to NumPy datatypes.
Note
read_parquet()is a convenience alias forDataFrame.read_parquet().
- dataiter.std(x, *, ddof=0, drop_na=True)[source]¶
Return the standard deviation of x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.Uses
numpy.std, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.std.html>>> di.std(di.Vector([3, 6, 7])) 1.699673171197595 >>> di.std("x") <function std.<locals>.aggregate at 0x7f0e48e84c20>
- dataiter.sum(x, *, drop_na=True)[source]¶
Return the sum of x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.>>> di.sum(di.Vector([1, 2, 3])) 6 >>> di.sum("x") <function sum.<locals>.aggregate at 0x7f4b4bcf0c20>
- dataiter.var(x, *, ddof=0, drop_na=True)[source]¶
Return the variance of x.
If x is a string, return a function usable with
DataFrame.aggregate()that operates group-wise on column x.Uses
numpy.var, see the NumPy documentation for details: https://numpy.org/doc/stable/reference/generated/numpy.var.html>>> di.var(di.Vector([3, 6, 7])) 2.888888888888889 >>> di.var("x") <function var.<locals>.aggregate at 0x7fcbf860cc20>