dataiter.Vector¶
__init__()
as_boolean()
as_bytes()
as_date()
as_datetime()
as_float()
as_integer()
as_object()
as_string()
concat()
drop_na()
equal()
fast()
get_memory_use()
head()
is_boolean()
is_bytes()
is_datetime()
is_float()
is_integer()
is_na()
is_number()
is_object()
is_string()
is_timedelta()
length
map()
na_dtype
na_value
range()
rank()
replace_na()
sample()
sort()
tail()
to_string()
tolist()
unique()
- class dataiter.Vector(object, dtype=None)[source]¶
A one-dimensional array.
Vector is a subclass of NumPy
ndarray
. Note that not allndarray
methods have been overridden and thus by careless use of baseclass in-place methods you might manage to twist the data into multi-dimensional or other non-vector form, causing unexpected results.https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html
- __init__(object, dtype=None)[source]¶
Return a new vector.
object can be any one-dimensional sequence, such as a NumPy array, Python list or tuple. Creating a vector from a NumPy array will be fast, from other types slower as data types and special values will need to be converted.
dtype is the NumPy-compatible data type for the vector. Providing dtype will make creating the vector faster, otherwise the appropriate data type will be guessed by introspecting the elements of object, which is potentially slow, especially for large objects.
>>> di.Vector([1, 2, 3], int) [ 1 2 3 ] int64
- as_boolean()[source]¶
Return vector converted to boolean data type.
>>> vector = di.Vector([0, 1]) >>> vector.as_boolean() [ False True ] bool
- as_bytes()[source]¶
Return vector converted to bytes data type.
>>> vector = di.Vector(["a", "b"]) >>> vector.as_bytes() [ b'a' b'b' ] |S1
- as_date()[source]¶
Return vector converted to date data type.
>>> vector = di.Vector(["2020-01-01"]) >>> vector.as_date() [ 2020-01-01 ] datetime64[D]
- as_datetime(precision='us')[source]¶
Return vector converted to datetime data type.
>>> vector = di.Vector(["2020-01-01T12:00:00"]) >>> vector.as_datetime() [ 2020-01-01T12:00:00.000000 ] datetime64[us]
- as_float()[source]¶
Return vector converted to float data type.
>>> vector = di.Vector([1, 2, 3]) >>> vector.as_float() [ 1 2 3 ] float64
- as_integer()[source]¶
Return vector converted to integer data type.
>>> vector = di.Vector([1.0, 2.0, 3.0]) >>> vector.as_integer() [ 1 2 3 ] int64
- as_object()[source]¶
Return vector converted to object data type.
>>> vector = di.Vector([1, 2, 3]) >>> vector.as_object() [ 1 2 3 ] object
- as_string(length=None)[source]¶
Return vector converted to string data type.
>>> vector = di.Vector([1, 2, 3]) >>> vector.as_string() [ "1" "2" "3" ] <U21 >>> vector.as_string(64) [ "1" "2" "3" ] <U64
- concat(*others)[source]¶
Return vector with elements from others appended.
>>> a = di.Vector([1, 2, 3]) >>> b = di.Vector([4, 5, 6]) >>> c = di.Vector([7, 8, 9]) >>> a.concat(b, c) [ 1 2 3 4 5 6 7 8 9 ] int64
- drop_na()[source]¶
Return vector without missing values.
>>> vector = di.Vector([1, 2, 3, None]) >>> vector.drop_na() [ 1 2 3 ] float64
- equal(other)[source]¶
Return whether vectors are equal.
Equality is tested with
==
. As an exception, corresponding missing values are considered equal as well.>>> a = di.Vector([1, 2, 3, None]) >>> b = di.Vector([1, 2, 3, None]) >>> a [ 1 2 3 nan ] float64 >>> b [ 1 2 3 nan ] float64 >>> a.equal(b) True
- classmethod fast(object, dtype=None)[source]¶
Return a new vector.
Unlike
__init__()
, this will not convert special values in object. Use this only if you know object doesn’t contain special values or if you know they are already of the correct type.
- get_memory_use()[source]¶
Return memory use in bytes.
>>> vector = di.Vector(range(100)) >>> vector.get_memory_use() 800
- head(n=None)[source]¶
Return the first n elements.
>>> vector = di.Vector(range(100)) >>> vector.head(10) [ 0 1 2 3 4 5 6 7 8 9 ] int64
- is_datetime()[source]¶
Return whether vector data type is datetime.
Dates are considered datetimes as well.
- is_na()[source]¶
Return a boolean vector indicating missing data elements.
>>> vector = di.Vector([1, 2, 3, None]) >>> vector [ 1 2 3 nan ] float64 >>> vector.is_na() [ False False False True ] bool
- property length¶
Return the amount of elements.
>>> vector = di.Vector(range(100)) >>> vector.length 100
- map(function, *args, dtype=None, **kwargs)[source]¶
Apply function element-wise and return a new vector.
>>> import math >>> vector = di.Vector(range(10)) >>> vector.map(math.pow, 2) [ 0 1 4 9 16 25 36 49 64 81 ] float64
- property na_dtype¶
Return the corresponding data type that can handle missing data.
You might need this for upcasting when missing data is first introduced.
>>> vector = di.Vector([1, 2, 3]) >>> vector [ 1 2 3 ] int64 >>> vector.put([2], vector.na_value) Traceback (most recent call last): File "<string>", line 14, in <module> ValueError: cannot convert float NaN to integer >>> vector = vector.astype(vector.na_dtype) >>> vector [ 1 2 3 ] float64 >>> vector.put([2], vector.na_value) None >>> vector [ 1 2 nan ] float64
- property na_value¶
Return the corresponding value to use to represent missing data.
Dataiter is built on top of NumPy. NumPy doesn’t support a proper missing value (“NA”), only data type specific values:
np.nan
,np.datetime64("NaT")
andnp.timedelta64("NaT")
. Dataiter recommends the following values be used and internally supports them to an extent.datetime
np.datetime64("NaT")
float
np.nan
integer
np.nan
string
""
timedelta
np.timedelta64("NaT")
other
None
Note that actually using these might require upcasting the vector. Integer will need to be upcast to float to contain
np.nan
. Other, such as boolean, will need to be upcast to object to containNone
.If you need to avoid object columns, you can also consider converting booleans to float using
as_float()
, which will give you 0.0 for false and 1.0 for true. Depending on how you use the data, that might work as well as an object vector ofTrue
,False
andNone
.
- range()[source]¶
Return the minimum and maximum values as a two-element vector.
>>> vector = di.Vector(range(100)) >>> vector.range() [ 0 99 ] int64
- rank(*, method='average')[source]¶
Return the order of elements in a sorted vector.
method determines how ties are resolved. ‘min’ assigns each of equal values the same rank, the minimum of the set (also called “competition ranking”). ‘max’ is the same, but assigning the maximum of the set. ‘average’ is the mean of ‘min’ and ‘max’. ‘ordinal’ gives each element a distinct rank with equal values ranked by their order in input.
Ranks begin at 1. Missing values are ranked last.
References
>>> vector = di.Vector([3, 1, 1, 1, 2, 2]) >>> vector.rank(method="min") [ 6 1 1 1 4 4 ] int64 >>> vector.rank(method="max") [ 6 3 3 3 5 5 ] int64 >>> vector.rank(method="average") [ 6.0 2.0 2.0 2.0 4.5 4.5 ] float64 >>> vector.rank(method="ordinal") [ 6 1 2 3 4 5 ] int64
- replace_na(value)[source]¶
Return vector with missing values replaced with value.
>>> vector = di.Vector([1, 2, 3, None]) >>> vector.replace_na(0) [ 1 2 3 0 ] float64
- sample(n=None)[source]¶
Return randomly chosen n elements.
>>> vector = di.Vector(range(100)) >>> vector.sample(10) [ 4 14 17 24 37 61 84 86 93 94 ] int64
- sort(*, dir=1)[source]¶
Return elements in sorted order.
dir is
1
for ascending sort,-1
for descending.Missing values are sorted last, regardless of dir.
>>> vector = di.Vector([1, 2, 3, None]) >>> vector.sort(dir=1) [ 1 2 3 nan ] float64 >>> vector.sort(dir=-1) [ 3 2 1 nan ] float64
- tail(n=None)[source]¶
Return the last n elements.
>>> vector = di.Vector(range(100)) >>> vector.tail(10) [ 90 91 92 93 94 95 96 97 98 99 ] int64
- to_string(*, max_elements=None)[source]¶
Return vector as a string formatted for display.
>>> vector = di.Vector([1/2, 1/3, 1/4]) >>> vector.to_string() [ 0.500000 0.333333 0.250000 ] float64
- to_strings(*, ksep=None, quote=True, pad=False, truncate_width=inf)[source]¶
Return vector as strings formatted for display.
>>> vector = di.Vector([1/2, 1/3, 1/4]) >>> vector.to_strings() [ "0.500000" "0.333333" "0.250000" ] <U8