|
Judy Glossary
The follow list of terms was originally written in May 2000, based on Judy
III and early Judy IV, without reference to outside academic source.
There is no guarantee that our definitions agree with consensus (academic)
usage...
Abstract data type (ADT): Any data structure that can hold a
variety of specific "key" datatypes, such as integers or characters. It's
the relationship between the keys, not the type of the keys, that defines
the ADT. We also use ADT to refer to cases in which a data structure
hasvariable parts (nodes) to adapt to the data being stored.
Adjacent (Neighbor): In a data structure (ADT) such as
Judy1/JudyL, Indexes (or keys) are stored in numerically sorted order. The
numerically closest (less and/or greater) to the one specified is the
adjacent or neighbor key. In JudySL, Indexes (null terminated string keys)
are stored in lexicography (dictionary) sorted order. The likelihood of a
neighbor being stored physically close in memory is very high. During
neighbor searches, access times are usually very fast, because the path
leading to them is typically in cache memory (from the previous access).
ADT: See Abstract Data Type.
Amortization: In a tree hierarchy, especially one with a high
fan-out (256 in Judy1/L), it is possible to waste memory to obtain better
performance in the higher levels of the tree. The wasted memory is
"amortized" over the high population in the lower levels. This memory is
normally insignificant relative to the whole tree structure.
API: Application Programming Interface, a specification of
the signatures of a collection of useful methods or functions and possibly
global variables.
Bitmap branch [node]: A Bitmap Branch is one of the three (3)
types of branches (nodes) used in Judy1 /L. A bitmap branch is a compressed
(memory efficient) structure of a node that has 256 virtual pointers (may or
may not be present). A bitmap of 256 bits is used to mark the expanses that
have a population and thus a pointer to lower levels of the tree. In the
design of Judy, there is only one (1) of these nodes in any path down the
tree to a leaf node and it is the node nearest the leaf. When the
population of Indexes below a bitmap branch reaches a point where the memory
used can be effectively amortized, it is converted to an "Uncompressed
Branch".
Bitmap leaf [node]: When the expanse of a leaf node is 256, a
bitmap leaf is a very fast structure that does not require searching. It
needs 32 bytes (256 bits) of memory to support up to 256 Indexes. It is
used when the population of that expanse exceeds about 24 Indexes.
Branch [node]: A non-leaf node in a Judy tree other than a
JPM node. This term is used as a noun, although this is a bit weird at first
(but it is used this way in academic papers too), in order to give a more
specific name to this type of node. Note that Judy branches contain more
than just pointers (addresses); they contain JPs (Judy Pointers), which are a
type of "rich pointer" that associates information with the address.
Branch width (fanout): The number of possible (virtual) or
actual pointers in a branch (the number of subtrees) which is related to,
but not identical to, the branch's memory size in words or bytes. A branch
node divides an expanse into a collection of subexpanses, each with one
associated pointer, some of which can be null (if the subexpanse is
unpopulated). A linear orbitmap branch has an actual fanout (width) lower
than its virtual width, whereas a Judy uncompressed branch has virtual fan
out = actual fanout = 256.
By-expanse: Dividing an index set into subsets, as in a
hierarchical data structure, by partitioning the indexes according to a
prearranged series of subexpanses, also referred to in some literature as "a
priori" storage. A digital tree is a by-expanse tree. In practical terms
by-expanse storage means subdividing evenly, and with the same fanout at
each level, in fact using whole numbers of bits (2^N fanout at each branch
node). Judy divides index sets by-expanse with a very wide digit sizeof 8
bits (256 fanout). By-expanse trees are by nature unbalanced, but can also
have predictable depth for fixed-size keys.
By-population: Dividing an index set into subsets, as in a
hierarchical data structure, by partitioning the indexes according to which
indexes are present. A binary storage tree is a by-population tree. The
partitioning rules are prearranged as is true for by-expanse storage, but
tend to be more about tree balancing than about locality of modifications,
meaning the partitioning ofany newly inserted indexes depends more on the
index set already stored than about the value of the new index itself.
Cache fill: See CPU Cache Fill.
Cache line: See CPU Cache Line.
Cascade: During insertion of an index, this is when a leaf
data structure overflows and more nodes must be added to a tree (or in some
cases the leaf's indexes can be compressed into a lower-level leaf under a
narrow pointer). A decascade (coalesce) is when the opposite occurs during
an index deletion. Cascading results from changing an index set into one of
its adjacents -- in the Judy context, by inserting or deleting one index.
Compressed branch: A linear or bitmap branch.
Compressed leaf: A bitmap leaf; also used to refer to index
compression in lower-level leaves, where in the already-decoded (at higher
levels) leading digits are not present.
Counting tree: A tree where each branch node records the
number of indexes stored under it, or equivalently, as in Judy1 and JudyL,
where each pointer in the branch node has associated a count of the indexes
stored under or in the branch or leaf pointed at by the pointer. These
counts can be used to rapidly determine the number of valid indexes in any
range or expanse, that is, between any pair of indexes. They are also useful
internally when inserting, deleting, or searching for previous or next
indexes. Alternatively, each node or pointer could have associated a count
which is a total of the application's values associated with each index in
its index set. This can be useful in certain applications for determining
index value density instead of index density. Note: Wider digital branches,
even if otherwise efficient (meaning few null pointers), cause more work,
and possibly more cache fills, while gathering counts, since every pointer
in the wide branch must be visited.
Cache fill (cache line fill): Needing to go to main memory
(RAM), as the result of a cache miss, for one or more bytes not currently in
any cache line; results in filling a cacheline; takes 30-150 times longer
than a cache hit. Note that sometimes a cache fill results in a disk
(buffer) cache miss and fill (swap in), by reading from mass storage, which
is even more expensive. In this document, "cache" refers to CPU cache lines
and does not even consider diskcache.
Cache line: A block of memory local to a CPU chip that caches
the same-sized block from the computer's main memory. This is analogous to
using a buffer cache to minimize disk I/O, but is implemented in hardware.
Cache lines are size-aligned and typically 16 words in size in
modern processors. Often there are hidden constraints on them, such as they
can only map to certain offsets in memory.
Decascade: See Cascade.
Decoded index bits/bytes/digits: The most-significant
bits/bytes/digits of an index that were already used, while traversing the
tree, to do address calculations in higher-level branch nodes.
Density: A population divided by its expanse.
Digit: A key, or a field, or portion of a key, that is used
to address through a branch node in a digital tree. For Judy each digit is
one byte (eight bits), "very wide" as digital trees go.
Digital tree = trie: A tree data structure in which each node
divides its expanse uniformly into N same-sized subexpanses, each with its
own pointer to a next-lower node. In a pure digital tree the indexes need
not be stored in the tree. They are used (piecemeal, digit-by-digit) to
address into each node's array of pointers. Hence N is typically a power of
2, so integer numbers of bits are "decoded" from the index while traversing
down the tree. An ordinary array can be thought of as a one-level trie where
the entire index is used in a single addressing calculation. Note that N
need not be the same for every node in a trie, if there's a way to determine
the appropriate N for each node. For Judy, N = 256 (8 bits) at every level,
but compressed branches have an actual fanout that is less than their
virtual fanout.
Dynamic range: The population (size) or population-type
(distribution of keys or indexes) over which an algorithm or data structure
works well (untuned or with tuning). Judy has a wide dynamic range. Most
ADTs work well only overspecific, "tuned" or designed-in (limited) dynamic
ranges.
Expanse: The range of possible indexes (keys) that can be
stored under one pointer (root or lower) in a tree structure. For example, a
root pointer to a 32-bit Judy array has an expanse of 0 .. 2^32 - 1.
Fanout: See Branch Width.
Fixed-size index: A key to an array-like data structure which
is always the same size, in practice one machine word in size. Judy1 and
JudyL support fixed-size, one-word indexes only. Judy grew out of a need to
solve this problem efficiently, although in the academic literature only
variable-size index problems seem to be considered interesting. Ironically a
meta-trie consisting of JudyL arrays as very-wide branch nodes can be used
to support variable-size keys quite nicely.
Grow-in-place: Allocating memory in chunks that are often
larger than the actual bytes required, by rounding up to a lesser number of
different memory chunk sizes, means there is often somepadding present in a
given node. Thus the node can be modified in place during an insert
operation, and it can also shrink in place during a deletion without having
to be reallocated to a smaller size.
Horizontal compression: See Index Compression.
Hybrid data structures: Using different ADTs at different
levels or for different nodes in a tree data structure. This means switching
to a different ADT while traversing the tree as the expanseand/or population
shrinks, to best represent the population's indexes with the fewest cache
fills and the least amount of memory.
Hysteresis: Leaving a system (in this case a Judy data
structure) in a previous state even when it could be modified to a new
state, with the result that the present state depends on the "direction"
(such as insertion or deletion) from which it was approached. Any
application that deletes indexes is likely to perform a series of insertions
and deletions, often within a relatively small expanse ofindexes. Hence it
might be faster on average to not revise the data structure to the optimal
form for each new index set (after each operation). Especially upon deletion
followed by insertion, it can result in superior performance if decascades
are postponed. The risk with hysteresis is that random or pathological
sequences of insertions and deletions can result in a greatly suboptimal
data structure -- and it's difficult to identify them or rule them out. Judy
presently makes use of limited hysteresis (one index at most) in some cases
of deletion.
Immediate indexes: Similar to how a machine instruction can
contain an immediate (constant) rather than indirect (variable) value, if
the population in a Judy expanse is low enough (perhaps just one index),
depending on the data structure (memory space associated with each pointer
in a branch), the population's indexes can be stored locally in a JP rather
than in a separate leaf node under the JP, saving one indirection and
possibly some memory. This requires encoding into the JP the special case of
immediate indexes.
Index: A key to an array or to an array-like API.
Index compression = horizontal compression: At any level in a
tree data structure, all indexes in the current population (index set within
the current expanse) might have one or more most-significant bits (MSB) in
common. These common bits can be stored just once in multi-index shortcut
leaves, in a variety of possible "compression schemes", such as: q Storing
the first valid index in full and the others as relative offsets, either
from the first, or from each previous. q Storing the common bits only once,
and for each valid index, only the bits that might differ. This is the same
as relative offsets, all from the first value, which is a base value not
necessarily equal to any one of the indexes. In practice, due to machine
instruction efficiencies, common MSBs are compressed only in units of whole
bytes. For example, in a 32 bit system, there might be 0, 1, 2, or 3 common
bytes, and thus 1, 2, 3, or 4 varying bytes, in the indexes in a leaf. At a
low enough level in the tree, say where 3 bytes were already decoded, only 1
byte can vary. Furthermore, Judy does not bother with the kinds of complex
compression schemes described above because we think the extra CPU time
involved would out weigh the space savings. Judy does have the ability to
compress a collection of indexes to a lower-level leaf, which has a greater
capacity, under a narrow pointer, which skips 1 or more common leading
bytes, if all of the indexes in the expanse are in the narrower leaf's
expanse.
Index set: Given a data structure capable of holding 0..N
indexes (keys, elements), any one [sub]set of those 0..N indexes is an index
set. For a given expanse, each possible population's set of indexes is one
index set. It must be possible to store and distinguish every unique index
set inthe data structure, regardless of the sequence of insert/delete
operations used to arrive at that indexset, or else the data structure is
not very useful. Of course, Judy arrays pass this test.
Index: A key value to an ADT that appears array-like at the
application interface. Since Judy appears array-like, keys into Judy ADTs are
referred to as indexes.
Informational pointer: See JP.
Iteration: A procedure where a series of operations is
repeated.
JP: Judy Pointer, a two-word object that populates branch
nodes, also referred to... as a "rich pointer" or "informational pointer". A
JP contains an address field, except when the JP is null (represents an
unpopulated subexpanse) or contains immediate indexes, plus other
descriptivefields such as Decode bytes, Population count, and
Type.
JPM: Judy population/memory node, at most one per tree (Judy
array), used when the array/tree population is too large to fit in a
root-level leaf and hence is large enough to amortize the memory needed for
the JPM. A root pointer points to a JPM which contains a top-level JP which
in turn points to a top-level branch node. The JPM contains additional
fields that make tree management faster and easier.
Judy array: An ADT that acts much like an ordinary array, but
which appears unbounded (exceptby the expanse of the index) and is allocated
by the first store/insert into the array. Judy arrays are hybrid digital
trees consisting of a variety of branch and leaf node ADTs. Judy indexes can
be inserted, retrieved, deleted, and searched in sorted order.
Judy pointer: See JP.
Judy population/memory node: See JPM.
Judy tree: The internal data structure used to implement the
data stored in what is presented externally as a Judy array.
Judy1: Bit array that maps a long word index to Boolean
(true/false).
JudyL: Word array that maps a long word index to a long word
value.
JudySL: Word array with string index; map string index to
long word value. Built as a meta-trie using JudyL arrays as extremely wide
branch nodes. (Includes the use of JLAP_INVALID in JudyL value areas
(subsidiary root pointers) to "escape" to shortcut leaves for unique string
suffixes, a critical though subtle feature.)
Key: A data value (bitstring) used to look up related data
values in a data structure that holds a collection of keys. See also Index.
Level compression = vertical compression: In the academic
sense this means skipping levels in the tree when a branch node would
otherwise have a fanout of 1. Read about Patricia tries for an example. Judy
does this using narrow pointers.
Linear branch [node]: A Judy non-leaf branch node for a
low-fanout population. Linear branches are constrained to one cache line =
16 words, hence hold 1..7 JPs out of a virtual fanoutof 256. Populated
subexpanses are listed by-population (1 byte/digit each), along with a
corresponding list of JPs.
Linear leaf [node]: A multi-index leaf (a terminal node) that
holds a population too large to fit in an immediate index JP but small
enough to fit in 2 cache lines (for JudyL, 4 cache lines including value
areas). By convention a root-level leaf is referred to separately, and a
linear leaf is always below the root level. Note that a linear leaf is never
large enough to be fully populated, that is, tohold 256 indexes.
Locality: Designing a data structure so a single insertion or
deletion has the least wide-ranging effects on average and/or in the worst
case. Ideally, every index set's optimal data structure is very similar to
every adjacent index set's optimal data structure, implying excellent
average and worst-case locality. If this cannot be achieved, at least
instances of poor locality should be relatively rare.
LSB: Least Significant Bytes/Bits.
Memory manager: Any dynamic multi-node data structure like
Judy requires an underlying memory manager from which it can obtain chunks
of memory, and to which it can possibly free(return) them. To minimize
wasted memory and reduce the frequency and cost of pathological cases due to
memory fragmenting into small and non-reusable chunks, the data structure
should use the minimum number of different sizes of memory chunks, maximize
the frequency of reuse of freed memory chunks, and/or the memory manager
itself should be smart and efficient about merging adjacent free chunks into
larger ones for reuse. (The malloc(3C) library is an example of a
memory manager.)
Meta-trie: A hybrid digital tree (trie) whose branch nodes
are in turn smaller tries. For example, a variable-size key trie might use
fixed-size key tries like JudyL arrays as its branch nodes, especially given
an "escape" mechanism like JLAP_INVALID to mark a JudyL value area as being
other than a root pointer to a subsidiary JudyL array, such as a shortcut or
multi-index leaf node.
MSB: Most Significant Bytes/Bits.
Multi-index leaf: If the population of an expanse in a trie
is small enough, it can be more efficient to store the population's indexes
in a multi-index leaf object rather than under additional trie nodes leading
to single-index leaves.
Narrow pointer: Suppose that high in a digital tree data
structure there's a large expanse, with apopulation large enough to not fit
in a single leaf node, but whose members are clustered closely enough to have
lots of index compressibility due to common leading bits/bytes/digits. A
narrow pointer is a way of simultaneously skipping levels and unoccupied
expanses in the tree. Associated with the narrow pointer must be the common
index bits being "decoded" via the pointer, unless the whole indexes are
stored in the leaves. Judy1 and JudyL support only fixed-size indexes (1
word each), so Decode bytes in the JPs in the branches describe the
digits skipped by narrow pointers.
Neighbor: In a sorted index set, the neighbor of any index
(element) is the previous or next valid index in the index set. One of Judy's
strengths is its ability to rapidly locate neighbors.
Opaqueness: The property of hiding the internal details of a
data structure or algorithm. Because Judy is very opaque, the application
sees it purely as an array. Users don't need to know the array is implemented
as a tree, or actually as an even more complex hybrid ADT.
Optimal data structure: Given a set of unambiguous criteria,
if they exist, by which to rank different data structures capable of holding
the same index set, there should be exactly one best data structure to
represent each possible index set. Ideally the implementation of each
adjacency operation would then convert an optimal data structure (for one
index set) into another optimal data structure (for its adjacent index set).
It's possible to have no unambiguous criteria for optimizing data structures.
For example, Judy seeks to simultaneously: q Minimize index access (get)
time, both average and worst-case. q Minimize index insert/delete time,
both average and worst-case. q Minimize memory consumed and thus maximize
efficiency. q Be as simple as possible -- but we had to give this up in the
transition from Judy III to Judy IV. We learned that a wide dynamic range
plus high-performance (although still portable) software requires a lot of
different data structures (tree nodes) to pick from, plus maximally in-line
code with minimal functions, loops, etc. The optimal data structure for any
index set depends sensitively on the tradeoffs chosen between conflicting
criteria, and might not even be "computable". (It worries me that whenever
this ambiguity or complexity exists we might overlook better data structures
or algorithms.) Ironically, space and time need not always be in conflict.
Using data compression techniques to save memory, at the cost of some CPU
time to compress/decompress, can reduce the overall size of a data structure,
hence average CPU cache fills, resulting in less overall time to access an
index.
Outlier: An index that falls within a given expanse but not
within a narrow subexpanse of that expanse. For example, given indexes all
beginning with 0x0102... in a leaf under a narrow pointer, an index beginning
with 0x0103... would be an outlier under slot (digit) 0x01 of the top
level branch.
Parent [node]: A branch that contains a pointer to some other
node (a child node).
Pathology: A pathological index set is one that brings out
the worst-case behavior in a given algorithm and/or data structure. A
pathological case is a sequence of insertions and deletions (or other
adjacency operations) that result in a pathological index set and/or a
suboptimal data structure, and/or which takes much longer to perform than
the average.
Population: The number of indexes that are stored in one
expanse, or the indexes them selves (that is, the index set), depending on
the context.
Positional lookup: Data, such as pointers in a trie node,
which are located via address calculations rather than by linear, binary, or
any other type of search. Note that a positional lookup results in at most
one cache fill -- so long as each data element itself does not cross cache
line boundaries -- no matter how large the array or node in which the
positional lookup is done.
Recursion: When a function calls itself.
Rich pointer: See JP.
Root pointer: A pointer to the top node of a tree structure.
(Tree data structures are typically drawn upside down, with the root at the
top and the leaves at the bottom.)
Root-level leaf: A simple linear list of whole indexes,
possibly (in JudyL) with associated value areas, used for low-population
Judy arrays (less than or equal to 31 indexes = 2 cache lines). The root
pointer indicates the presence and type of the root-level leaf. A generic
root-level leaf starts with a population count word. Small root-level leaves
have their populations encoded in the root pointer to save memory.
Shortcut leaf: By default, a full trie for a fixed-size index
contains a fixed number of levels. If below a certain point there is only one
index stored, that is, an expanse has a population of 1, it can be
considerably more efficient to store that index in a non-trie leaf object
directly below the last node containing 2 or more indexes in its expanse. A
shortcut leaf must contain all of the remaining undecoded bits of the index
it contains, since they are not encoded positionally in additional trie
nodes. Shortcut leaves are mainly useful for variable-size key trees such as
JudySL arrays, although multi-index leaves can be seen as a type of shortcut
leaf too.
Sibling branch [node]: A branch node with the same parent as
another branch; a peer.
Sorted index set: An index set whose members are uniquely
sorted according to some rule, suchas numerically or lexicographically.
Tree data structure: Most definitions of Binary and
Digital/Tries tree data structures miss the fundamental difference. A
typical Binary tree has nodes (or forks) that cut up the "population" in an
attempt to keep the depth of the tree to a minimum. This requires a
comparison test be done at every node to determine which branch (fork) to
follow. A perfectly balanced binary tree has a very predictable depth
proportional to (log(base 2) of) the population. In contrast, a
Digital/Trie tree the population is cut up by "expanse". That is a portion
of the Index/key is used (8 bits with Judy1/L) is used to determine which of
the 256 nodes in the branch to follow. The maximum depth of a Judy1/L tree
is (in 32 bit version) 32/8=4. In the past, digital trees have not been
popular because of the astronomical amount of memory it takes to implement
them. Judy solves that problem very well. Judy's data structures are
tailored to support the population below an expanse (node) at any given
time -- whether it be 256, 65536, or any possible 256^n expanse.
Trie: See Digital Tree.
Unbounded array: An array maps an index to a value. Normally
an array is defined (at compiletime) or allocated (at run time) with some
fixed size, such as abc[10], which in C means indexes 0..9 are valid. An
unbounded array is one where the size of the array is only limited by the
size of the index. On a 32-bit system, for Judy1 and JudyL, where the index
is a native machine word,this means 32 bits, or indexes 0..2^32-1. For
JudySL, this means any string you can fit in memory. So unbounded doesn't
mean infinite, it means not arbitrarily bounded at less than the
machine's limits.
Uncompressed branch node: A simple array of 256 JPs,
including null JPs for unpopulated subexpanses.
Valid/invalid [index]: A valid index is one that appears in a
index set; that is, it's set or stored in a Judy array. An invalid index is
one that could appear in the index set but does not; that is, it's not
currently stored in a Judy array. There are many synonyms for each word,
such as: present/absent,set/unset, stored/free, full/empty.
Variable-size index: A key to an array-like data structure
that handles a variety of index sizes, or possibly unbounded sizes.
Bitstrings and character strings are examples of variable-size keys.JudySL
is an array-like API implemented as a meta-trie that supports character
string indexes.
Vertical compression: See Level Compression and Digital Tree.
Word: A unit of memory that is the same size as a pointer to
pointer to void, and/or an unsigned long, in native C; typically 32 or 64
bits today. PA-RISC and IPF hardware support both 32-bit and 64-bit programs
depending on compilation options.
|