dataiter.Vector¶

__init__() as_boolean() as_bytes() as_date() as_datetime() as_float() as_integer() as_object() as_string() concat() drop_na() equal() fast() get_memory_use() head() is_boolean() is_bytes() is_datetime() is_float() is_integer() is_na() is_number() is_object() is_string() is_timedelta() length map() na_dtype na_value range() rank() replace_na() sample() sort() tail() to_string() tolist() unique()

class dataiter.Vector(object, dtype=None)[source]¶

A one-dimensional array.

Vector is a subclass of NumPy ndarray. Note that not all ndarray methods have been overridden and thus by careless use of baseclass in-place methods you might manage to twist the data into multi-dimensional or other non-vector form, causing unexpected results.

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

__init__(object, dtype=None)[source]¶

Return a new vector.

object can be any one-dimensional sequence, such as a NumPy array, Python list or tuple. Creating a vector from a NumPy array will be fast, from other types slower as data types and special values will need to be converted.

dtype is the NumPy-compatible data type for the vector. Providing dtype will make creating the vector faster, otherwise the appropriate data type will be guessed by introspecting the elements of object, which is potentially slow, especially for large objects.

>>> di.Vector([1, 2, 3], int)
[ 1 2 3 ] int64

as_boolean()[source]¶

Return vector converted to boolean data type.

>>> vector = di.Vector([0, 1])
>>> vector.as_boolean()
[ False True ] bool

as_bytes()[source]¶

Return vector converted to bytes data type.

>>> vector = di.Vector(["a", "b"])
>>> vector.as_bytes()
[ b'a' b'b' ] |S1

as_date()[source]¶

Return vector converted to date data type.

>>> vector = di.Vector(["2020-01-01"])
>>> vector.as_date()
[ 2020-01-01 ] datetime64[D]

as_datetime(precision='us')[source]¶

Return vector converted to datetime data type.

>>> vector = di.Vector(["2020-01-01T12:00:00"])
>>> vector.as_datetime()
[ 2020-01-01T12:00:00.000000 ] datetime64[us]

as_float()[source]¶

Return vector converted to float data type.

>>> vector = di.Vector([1, 2, 3])
>>> vector.as_float()
[ 1 2 3 ] float64

as_integer()[source]¶

Return vector converted to integer data type.

>>> vector = di.Vector([1.0, 2.0, 3.0])
>>> vector.as_integer()
[ 1 2 3 ] int64

as_object()[source]¶

Return vector converted to object data type.

>>> vector = di.Vector([1, 2, 3])
>>> vector.as_object()
[ 1 2 3 ] object

as_string(length=None)[source]¶

Return vector converted to string data type.

>>> vector = di.Vector([1, 2, 3])
>>> vector.as_string()
[ "1" "2" "3" ] <U21
>>> vector.as_string(64)
[ "1" "2" "3" ] <U64

concat(*others)[source]¶

Return vector with elements from others appended.

>>> a = di.Vector([1, 2, 3])
>>> b = di.Vector([4, 5, 6])
>>> c = di.Vector([7, 8, 9])
>>> a.concat(b, c)
[ 1 2 3 4 5 6 7 8 9 ] int64

drop_na()[source]¶

Return vector without missing values.

>>> vector = di.Vector([1, 2, 3, None])
>>> vector.drop_na()
[ 1 2 3 ] float64

equal(other)[source]¶

Return whether vectors are equal.

Equality is tested with ==. As an exception, corresponding missing values are considered equal as well.

>>> a = di.Vector([1, 2, 3, None])
>>> b = di.Vector([1, 2, 3, None])
>>> a
[ 1 2 3 nan ] float64
>>> b
[ 1 2 3 nan ] float64
>>> a.equal(b)
True

classmethod fast(object, dtype=None)[source]¶

Return a new vector.

Unlike __init__(), this will not convert special values in object. Use this only if you know object doesn’t contain special values or if you know they are already of the correct type.

get_memory_use()[source]¶

Return memory use in bytes.

>>> vector = di.Vector(range(100))
>>> vector.get_memory_use()
800

head(n=None)[source]¶

Return the first n elements.

>>> vector = di.Vector(range(100))
>>> vector.head(10)
[ 0 1 2 3 4 5 6 7 8 9 ] int64

is_boolean()[source]¶: Return whether vector data type is boolean.

is_bytes()[source]¶: Return whether vector data type is bytes.

is_datetime()[source]¶

Return whether vector data type is datetime.

Dates are considered datetimes as well.

is_float()[source]¶: Return whether vector data type is float.

is_integer()[source]¶: Return whether vector data type is integer.

is_na()[source]¶

Return a boolean vector indicating missing data elements.

>>> vector = di.Vector([1, 2, 3, None])
>>> vector
[ 1 2 3 nan ] float64
>>> vector.is_na()
[ False False False True ] bool

is_number()[source]¶: Return whether vector data type is number.

is_object()[source]¶: Return whether vector data type is object.

is_string()[source]¶: Return whether vector data type is string.

is_timedelta()[source]¶: Return whether vector data type is timedelta.

property length¶

Return the amount of elements.

>>> vector = di.Vector(range(100))
>>> vector.length
100

map(function, *args, dtype=None, **kwargs)[source]¶

Apply function element-wise and return a new vector.

>>> import math
>>> vector = di.Vector(range(10))
>>> vector.map(math.pow, 2)
[ 0 1 4 9 16 25 36 49 64 81 ] float64

property na_dtype¶

Return the corresponding data type that can handle missing data.

You might need this for upcasting when missing data is first introduced.

>>> vector = di.Vector([1, 2, 3])
>>> vector
[ 1 2 3 ] int64
>>> vector.put([2], vector.na_value)
Traceback (most recent call last):
  File "<string>", line 14, in <module>
ValueError: cannot convert float NaN to integer
>>> vector = vector.astype(vector.na_dtype)
>>> vector
[ 1 2 3 ] float64
>>> vector.put([2], vector.na_value)
None
>>> vector
[ 1 2 nan ] float64

property na_value¶

Return the corresponding value to use to represent missing data.

Dataiter is built on top of NumPy. NumPy doesn’t support a proper missing value (“NA”), only data type specific values: np.nan, np.datetime64("NaT") and np.timedelta64("NaT"). Dataiter recommends the following values be used and internally supports them to an extent.

datetime	`np.datetime64("NaT")`
float	`np.nan`
integer	`np.nan`
string	`""`
timedelta	`np.timedelta64("NaT")`
other	`None`

Note that actually using these might require upcasting the vector. Integer will need to be upcast to float to contain np.nan. Other, such as boolean, will need to be upcast to object to contain None.

If you need to avoid object columns, you can also consider converting booleans to float using as_float(), which will give you 0.0 for false and 1.0 for true. Depending on how you use the data, that might work as well as an object vector of True, False and None.

range()[source]¶

Return the minimum and maximum values as a two-element vector.

>>> vector = di.Vector(range(100))
>>> vector.range()
[ 0 99 ] int64

rank(*, method='average')[source]¶

Return the order of elements in a sorted vector.

method determines how ties are resolved. ‘min’ assigns each of equal values the same rank, the minimum of the set (also called “competition ranking”). ‘max’ is the same, but assigning the maximum of the set. ‘average’ is the mean of ‘min’ and ‘max’. ‘ordinal’ gives each element a distinct rank with equal values ranked by their order in input.

Ranks begin at 1. Missing values are ranked last.

References

>>> vector = di.Vector([3, 1, 1, 1, 2, 2])
>>> vector.rank(method="min")
[ 6 1 1 1 4 4 ] int64
>>> vector.rank(method="max")
[ 6 3 3 3 5 5 ] int64
>>> vector.rank(method="average")
[ 6.0 2.0 2.0 2.0 4.5 4.5 ] float64
>>> vector.rank(method="ordinal")
[ 6 1 2 3 4 5 ] int64

replace_na(value)[source]¶

Return vector with missing values replaced with value.

>>> vector = di.Vector([1, 2, 3, None])
>>> vector.replace_na(0)
[ 1 2 3 0 ] float64

sample(n=None)[source]¶

Return randomly chosen n elements.

>>> vector = di.Vector(range(100))
>>> vector.sample(10)
[ 4 14 17 24 37 61 84 86 93 94 ] int64

sort(*, dir=1)[source]¶

Return elements in sorted order.

dir is 1 for ascending sort, -1 for descending.

Missing values are sorted last, regardless of dir.

>>> vector = di.Vector([1, 2, 3, None])
>>> vector.sort(dir=1)
[ 1 2 3 nan ] float64
>>> vector.sort(dir=-1)
[ 3 2 1 nan ] float64

tail(n=None)[source]¶

Return the last n elements.

>>> vector = di.Vector(range(100))
>>> vector.tail(10)
[ 90 91 92 93 94 95 96 97 98 99 ] int64

to_string(*, max_elements=None)[source]¶

Return vector as a string formatted for display.

>>> vector = di.Vector([1/2, 1/3, 1/4])
>>> vector.to_string()
[ 0.500000 0.333333 0.250000 ] float64

to_strings(*, ksep=None, quote=True, pad=False, truncate_width=inf)[source]¶

Return vector as strings formatted for display.

>>> vector = di.Vector([1/2, 1/3, 1/4])
>>> vector.to_strings()
[ "0.500000" "0.333333" "0.250000" ] <U8

tolist()[source]¶

Return vector as a list with elements of matching Python builtin type.

Missing values are replaced with None.

unique()[source]¶

Return unique elements.

>>> vector = di.Vector([1, 1, 1, 2, 2, 3])
>>> vector.unique()
[ 1 2 3 ] int64