Struktura danych dla mapy w odstępach czasu

11

Niech będzie liczbą całkowitą, a oznacza zbiór wszystkich liczb całkowitych. Niech oznacza przedział liczb całkowitych .nZ[a,b]{a,a+1,a+2,,b}

Szukam struktury danych do reprezentowania mapy . Chcę, aby struktura danych obsługiwała następujące operacje:f:[1,n]Z

  • get(i) should return f(i).

  • set([a,b],y) should update f so that f(a)=f(a+1)==f(b)=y, i.e., update f to a new map f such that f(i)=y for i[a,b] and f(i)=f(i) for i[a,b].

  • stab(i) should return the largest interval [a,b] such that i[a,b] and f is constant on [a,b] (i.e., f(a)=f(a+1)==f(b)).

  • add([a,b],δ) should update f to a new map f such that f(i)=f(i)+δ for i[a,b] and f(i)=f(i) for i[a,b].

I want each of these operations to be efficient. I would count O(1) or O(lgn) time as efficient, but O(n) time is too slow. It's OK if the running times are amortized running times. Is there a data structure that simultaneously makes all of these operations efficient?

(I've noticed a similar pattern come up in a several programming challenges. This is a generalization that would suffice for all of those challenge problems.)

D.W.
źródło
I guess splay trees are the starting point. add would be linear in the number of subintervals of [a,b] though; have you thought about a splay tree with additional unary “+δ” nodes, compacted lazily?
Gilles 'SO- stop being evil'
Consider f such that f(i)f(j) for all i, j. Then you have to have n values stored somewhere. Performing set([a,b],y) has to get rid of those values somehow (by rewriting them or throwing them away -- you can postpone with GC, but you will have to do O(n) operations at some point). As such, the operation will be O(n).
avakar
@avakar, I would be happy with a solution that treats GC as effectively "free". More generally, I would be happy with a solution where the running times are amortized running times (thus the cost of GC can be amortized in the cost of creating the value in the first place).
D.W.
You noted that constant and logarithmic time are efficient, and linear time is slow. Would O(nlgn) time be too slow for your needs?
jbapple
@jbapple, hey, it's a start! I think that's worth documenting as an answer.
D.W.

Odpowiedzi:

4

I believe logarithmic time for all queries is achievable. The main idea is to use an interval tree, where each node in the tree corresponds to an interval of indices. I'll build up the key ideas by starting with a simpler version of the data structure (that can support get and set but not the other operations), then add features to support the other features as well.

A simple scheme (supports get and set, but not add or stab)

Say that an interval [a,b] is flat if the function f is constant on [a,b], i.e., if f(a)=f(a+1)==f(b).

Our simple data structure will be an interval tree. In other words, we have a binary tree, where each node corresponds to an interval (of indices). We'll store the corresponding interval I(v) in each node v of the tree. Each leaf will correspond to a flat interval, and they'll be arranged so that reading out the leaves left-to-right gives us a sequence of consecutive flat intervals which are disjoint and whose union is all of [1,n]. The interval for an internal node will be the union of the intervals of its two children. Also, in each leaf node we will store the value V() of the function f on the interval I() corresponding to this node (note that this interval is flat, so f is constant on the interval, so we just store a single value of f in each leaf node).

Equivalently, you can imagine that we partition [1,n] into flat intervals, and then the data structure is a binary search tree where the keys are the left endpoints of those intervals. The leaves contain the value of f at some range of indices where f is constant.

Use standard methods to ensure that the binary tree remains balanced, i.e., its depth is O(lgm) (where m counts the current number of leaves in the tree). Of course, mn, so the depth is always at most O(lgn). This will be helpful below.

We can now support the get and set operations as follows:

  • get(i) is easy: we traverse the tree to find the leaf whose interval contains i. This is basically just traversing a binary search tree. Since the depth is O(lgn), the running time is O(lgn).

  • set([a,b],y) is trickier. It works like this:

    1. First, we find the leaf interval [a0,b0] containing a; if a0<a, then we split this leaf interval into the two intervals [a0,a1] and [a,b0] (thus turning this leaf node into an internal node and introducing two children).

    2. Next, we find the leaf interval [a1,b1] containing b; if b<b1, we split this leaf interval into the two intervals [a1,b] and [b+1,b1] (thus turning this leaf node into an internal node and introducing two children).

    3. At this point, I claim that the interval [a,b] can be expressed as the disjoint union of O(lgn) intervals corresponding to some subset of O(lgn) nodes in the tree. So, delete all the descendants of those nodes (turning them into leaves) and set the value stored in those nodes to y.

    4. Finally, since we modified the shape of the tree, we'll perform any necessary rotations to re-balance the tree (using any standard technique for keeping a tree balanced).

    Since this operation involves a few simple operations on O(lgn) nodes (and that set of nodes can be easily found in O(lgn) time), the total time for this operation is O(lgn).

This shows that we can support both the get and set operations in O(lgn) time per operation. In fact, the running time can be shown to be O(lgmin(n,s)), where s is the number of set operations performed up until now.

Adding support for add

We can modify the above data structure so that it can also support the add operation. In particular, instead of storing the value of the function in the leaves, it'll be represented as the sum of numbers stored in a set of nodes.

More precisely, the value f(i) of the function at input i will be recoverable as the sum of the values stored in the nodes on the path from the root of the tree down to the leaf whose interval contains i. In each node v we'll store a value V(v); if v0,v1,,vk represent the ancestors of a leaf vk (including the leaf itself), then the value of function at I(vk) will be V(v0)++V(vk).

It is easy to support the get and set operations using a variant of the techniques described above. Basically, as we traverse the tree downward, we keep track of the running sum of values, so that for each node x that the traversal visit, we'll know the sum of values of the nodes on the path from the root to x. Once we do that, simple adjustments to the implementation of get and set described above will suffice.

And now we can support add([a,b],δ) efficiently. First, we express the interval [a,b] as the union of O(lgn) intervals corresponding to some set of O(lgn) nodes in the tree (splitting a node at the left endpoint and right endpoint if needed), exactly as done in steps 1-3 of the set operation. Now, we simply add δ to the value stored in each of those O(lgn) nodes. (We don't delete their descendants.)

This provides a way to support get, set, and add, in O(lgn) time per operation. In fact, the running time per operation is O(lgmin(n,s)) where s counts the number of set operations plus the number of add operations.

Supporting the stab operation

The stabbing query is the most challenging to support. The basic idea will be to modify the above data structure to preserve the following additional invariant:

(*) The interval I() corresponding to each leaf is a maximal flat interval.

Here I say that an interval [a,b] is maximal flat interval if (i) [a,b] is flat, and (ii) no interval containing [a,b] is flat (in other words, for all a,b satisfying 1aabbn, either [a,b]=[a,b] or [a,b] is not flat).

This makes the stab operation easy to implement:

  • stab(i) finds the leaf whose interval contains i, and then returns that interval.

However, now we need to modify the set and add operations to maintain the invariant (*). Each time we split a leaf into two, we might violate the invariant if some adjacent pair of leaf-intervals have the same value of the function f. Fortunately, each set/add operation adds at most 4 new leaf intervals. Also, for each new interval, it is easy to find the leaf interval immediately to the left and right of it. Therefore, we can tell whether the invariant was violated; if it was, then we merge adjacent intervals where f has the same value. Fortunately, merging two adjacent intervals does not trigger cascading changes (so we don't need to check whether the merger might have introduced additional violations of the invariant). In all, this involves examining 12=O(1) pairs of intervals and possibly merging them. Finally, since a merger changes the shape of the tree, if this violates the balance-invariants, perform any necessary rotations to keep the tree balanced (following standard techniques for keeping binary trees balanced). In total, this adds at most O(lgn) additional work to the set/add operations.

Thus, this final data structure supports all four operations, and the running time for each operation is O(lgn). A more precise estimate is O(lgmin(n,s)) time per operation, where s counts the number of set and add operations.

Parting thoughts

Phew, this was a pretty complex scheme. I hope I didn't make any mistakes. Please check my work carefully before relying upon this solution.

D.W.
źródło