dataiter.Vector

__init__() as_boolean() as_bytes() as_date() as_datetime() as_float() as_integer() as_object() as_string() concat() drop_na() equal() fast() get_memory_use() head() is_boolean() is_bytes() is_datetime() is_float() is_integer() is_na() is_number() is_object() is_string() is_timedelta() length map() na_dtype na_value range() rank() replace_na() sample() sort() tail() to_string() tolist() unique()

class dataiter.Vector(object, dtype=None)[source]

A one-dimensional array.

Vector is a subclass of NumPy ndarray. Note that not all ndarray methods have been overridden and thus by careless use of baseclass in-place methods you might manage to twist the data into multi-dimensional or other non-vector form, causing unexpected results.

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

__init__(object, dtype=None)[source]

Return a new vector.

object can be any one-dimensional sequence, such as a NumPy array, Python list or tuple. Creating a vector from a NumPy array will be fast, from other types slower as data types and special values will need to be converted.

dtype is the NumPy-compatible data type for the vector. Providing dtype will make creating the vector faster, otherwise the appropriate data type will be guessed by introspecting the elements of object, which is potentially slow, especially for large objects.

>>> di.Vector([1, 2, 3], int)
[ 1 2 3 ] int64
as_boolean()[source]

Return vector converted to boolean data type.

>>> vector = di.Vector([0, 1])
>>> vector.as_boolean()
[ False True ] bool
as_bytes()[source]

Return vector converted to bytes data type.

>>> vector = di.Vector(["a", "b"])
>>> vector.as_bytes()
[ b'a' b'b' ] |S1
as_date()[source]

Return vector converted to date data type.

>>> vector = di.Vector(["2020-01-01"])
>>> vector.as_date()
[ 2020-01-01 ] datetime64[D]
as_datetime(precision='us')[source]

Return vector converted to datetime data type.

>>> vector = di.Vector(["2020-01-01T12:00:00"])
>>> vector.as_datetime()
[ 2020-01-01T12:00:00.000000 ] datetime64[us]
as_float()[source]

Return vector converted to float data type.

>>> vector = di.Vector([1, 2, 3])
>>> vector.as_float()
[ 1 2 3 ] float64
as_integer()[source]

Return vector converted to integer data type.

>>> vector = di.Vector([1.0, 2.0, 3.0])
>>> vector.as_integer()
[ 1 2 3 ] int64
as_object()[source]

Return vector converted to object data type.

>>> vector = di.Vector([1, 2, 3])
>>> vector.as_object()
[ 1 2 3 ] object
as_string(length=None)[source]

Return vector converted to string data type.

>>> vector = di.Vector([1, 2, 3])
>>> vector.as_string()
[ "1" "2" "3" ] <U21
>>> vector.as_string(64)
[ "1" "2" "3" ] <U64
concat(*others)[source]

Return vector with elements from others appended.

>>> a = di.Vector([1, 2, 3])
>>> b = di.Vector([4, 5, 6])
>>> c = di.Vector([7, 8, 9])
>>> a.concat(b, c)
[ 1 2 3 4 5 6 7 8 9 ] int64
drop_na()[source]

Return vector without missing values.

>>> vector = di.Vector([1, 2, 3, None])
>>> vector.drop_na()
[ 1 2 3 ] float64
equal(other)[source]

Return whether vectors are equal.

Equality is tested with ==. As an exception, corresponding missing values are considered equal as well.

>>> a = di.Vector([1, 2, 3, None])
>>> b = di.Vector([1, 2, 3, None])
>>> a
[ 1 2 3 nan ] float64
>>> b
[ 1 2 3 nan ] float64
>>> a.equal(b)
True
classmethod fast(object, dtype=None)[source]

Return a new vector.

Unlike __init__(), this will not convert special values in object. Use this only if you know object doesn’t contain special values or if you know they are already of the correct type.

get_memory_use()[source]

Return memory use in bytes.

>>> vector = di.Vector(range(100))
>>> vector.get_memory_use()
800
head(n=None)[source]

Return the first n elements.

>>> vector = di.Vector(range(100))
>>> vector.head(10)
[ 0 1 2 3 4 5 6 7 8 9 ] int64
is_boolean()[source]

Return whether vector data type is boolean.

is_bytes()[source]

Return whether vector data type is bytes.

is_datetime()[source]

Return whether vector data type is datetime.

Dates are considered datetimes as well.

is_float()[source]

Return whether vector data type is float.

is_integer()[source]

Return whether vector data type is integer.

is_na()[source]

Return a boolean vector indicating missing data elements.

>>> vector = di.Vector([1, 2, 3, None])
>>> vector
[ 1 2 3 nan ] float64
>>> vector.is_na()
[ False False False True ] bool
is_number()[source]

Return whether vector data type is number.

is_object()[source]

Return whether vector data type is object.

is_string()[source]

Return whether vector data type is string.

is_timedelta()[source]

Return whether vector data type is timedelta.

property length

Return the amount of elements.

>>> vector = di.Vector(range(100))
>>> vector.length
100
map(function, *args, dtype=None, **kwargs)[source]

Apply function element-wise and return a new vector.

>>> import math
>>> vector = di.Vector(range(10))
>>> vector.map(math.pow, 2)
[ 0 1 4 9 16 25 36 49 64 81 ] float64
property na_dtype

Return the corresponding data type that can handle missing data.

You might need this for upcasting when missing data is first introduced.

>>> vector = di.Vector([1, 2, 3])
>>> vector
[ 1 2 3 ] int64
>>> vector.put([2], vector.na_value)
Traceback (most recent call last):
  File "<string>", line 14, in <module>
ValueError: cannot convert float NaN to integer
>>> vector = vector.astype(vector.na_dtype)
>>> vector
[ 1 2 3 ] float64
>>> vector.put([2], vector.na_value)
None
>>> vector
[ 1 2 nan ] float64
property na_value

Return the corresponding value to use to represent missing data.

Dataiter is built on top of NumPy. NumPy doesn’t support a proper missing value (“NA”), only data type specific values: np.nan, np.datetime64("NaT") and np.timedelta64("NaT"). Dataiter recommends the following values be used and internally supports them to an extent.

datetime

np.datetime64("NaT")

float

np.nan

integer

np.nan

string

""

timedelta

np.timedelta64("NaT")

other

None

Note that actually using these might require upcasting the vector. Integer will need to be upcast to float to contain np.nan. Other, such as boolean, will need to be upcast to object to contain None.

If you need to avoid object columns, you can also consider converting booleans to float using as_float(), which will give you 0.0 for false and 1.0 for true. Depending on how you use the data, that might work as well as an object vector of True, False and None.

range()[source]

Return the minimum and maximum values as a two-element vector.

>>> vector = di.Vector(range(100))
>>> vector.range()
[ 0 99 ] int64
rank(*, method='average')[source]

Return the order of elements in a sorted vector.

method determines how ties are resolved. ‘min’ assigns each of equal values the same rank, the minimum of the set (also called “competition ranking”). ‘max’ is the same, but assigning the maximum of the set. ‘average’ is the mean of ‘min’ and ‘max’. ‘ordinal’ gives each element a distinct rank with equal values ranked by their order in input.

Ranks begin at 1. Missing values are ranked last.

References

>>> vector = di.Vector([3, 1, 1, 1, 2, 2])
>>> vector.rank(method="min")
[ 6 1 1 1 4 4 ] int64
>>> vector.rank(method="max")
[ 6 3 3 3 5 5 ] int64
>>> vector.rank(method="average")
[ 6.0 2.0 2.0 2.0 4.5 4.5 ] float64
>>> vector.rank(method="ordinal")
[ 6 1 2 3 4 5 ] int64
replace_na(value)[source]

Return vector with missing values replaced with value.

>>> vector = di.Vector([1, 2, 3, None])
>>> vector.replace_na(0)
[ 1 2 3 0 ] float64
sample(n=None)[source]

Return randomly chosen n elements.

>>> vector = di.Vector(range(100))
>>> vector.sample(10)
[ 4 14 17 24 37 61 84 86 93 94 ] int64
sort(*, dir=1)[source]

Return elements in sorted order.

dir is 1 for ascending sort, -1 for descending.

Missing values are sorted last, regardless of dir.

>>> vector = di.Vector([1, 2, 3, None])
>>> vector.sort(dir=1)
[ 1 2 3 nan ] float64
>>> vector.sort(dir=-1)
[ 3 2 1 nan ] float64
tail(n=None)[source]

Return the last n elements.

>>> vector = di.Vector(range(100))
>>> vector.tail(10)
[ 90 91 92 93 94 95 96 97 98 99 ] int64
to_string(*, max_elements=None)[source]

Return vector as a string formatted for display.

>>> vector = di.Vector([1/2, 1/3, 1/4])
>>> vector.to_string()
[ 0.500000 0.333333 0.250000 ] float64
to_strings(*, ksep=None, quote=True, pad=False, truncate_width=inf)[source]

Return vector as strings formatted for display.

>>> vector = di.Vector([1/2, 1/3, 1/4])
>>> vector.to_strings()
[ "0.500000" "0.333333" "0.250000" ] <U8
tolist()[source]

Return vector as a list with elements of matching Python builtin type.

Missing values are replaced with None.

unique()[source]

Return unique elements.

>>> vector = di.Vector([1, 1, 1, 2, 2, 3])
>>> vector.unique()
[ 1 2 3 ] int64