2

Telcordia Technologies, Piscataway, NJ, USA. E-mail: [email protected] School of Computer Science, University of Electronic Science and Technology of China, China. E-mail: [email protected] 3 Department of Computer Science, University of Calgary, Canada. E-mail: [email protected]

March 8, 2011 Abstract. Collision-intractable hashing is an important cryptographic primitive with numerous applications including efficient integrity checking for transmitted and stored data, and software. In several of these applications, it is important that in addition to detecting corruption of the data we also localize the corruptions. This motivates us to introduce and investigate the new notion of corruption-localizing hashing, defined as a natural extension of collision-intractable hashing. Our main contribution is in formally defining corruption-localizing hash schemes and designing two such schemes, one starting from any collision-intractable hash function, and the other starting from any collision-intractable keyed hash function. Both schemes have attractive efficiency properties in three important metrics: localization factor, tag length and localization running time, capturing the quality of localization, and performance in terms of storage and time complexity, respectively. The closest previous results, when modified to satisfy our formal definitions, only achieve similar properties in the case of a single corruption.

1

Introduction

A collision-intractable hash function is a fundamental cryptographic primitive, that maps arbitrarily long inputs to fixed-length outputs, with the required property that it is computationally infeasible to obtain two inputs that are mapped to the same output. One popular application of such functions is in the authentication and integrity protection of communicated data (i.e., as building blocks in the construction of digital signatures and message authentication codes). Other popular and more direct applications include practical scenarios that demand reliability of downloaded software files and/or protection of stored data against malicious viruses, as we now detail. Software Reliability. Downloading software is a frequent need for computer users and checking the reliability of such software has become a task of crucial importance. One routinely used technique consists of accompanying software files with a short tag, computed as the output returned by a collision-intractable hash function on input the file itself. Later, the same function is used to detect whether the file has changed (assuming that no modification was done to the tag), and thus detect whether the software file was corrupted. An important example of the success of this technique is Tripwire [12], a widely available and recommended integrity checking program for the UNIX environment. However, with this approach even if one byte error (beyond the error-correction/detection capability of transmission protocols such as TCP) occurs in the transmission, the user has to download the whole file again. This is a waste of bandwidth and time. Alternatively, it would be desired to use a new kind of tag for which one can determine which blocks are corrupted and only retransmit those. Virus Detection. Some of the most successful modern techniques attempting to solve the problem of virus detection fall into the general paradigm of integrity checking; see, e.g. [20, 21] (in addition to other wellknown paradigms, such as virus signature detection, which we do not deal with here). As before, tags

computed using cryptographic hash functions detect any undesired changes in a given file or, more generally, file system (see, e.g., [5]) due to viruses. A taxonomy of virus strategies for changing files is given in [20]. With respect to that terminology, in the rest of the paper we consider the so-called ‘rewriting infection strategies’, where any single virus is allowed to rewrite up to a given number of consecutive blocks in a file (or, similarly, of consecutive files in a file system). In the context of virus defense, in the so-called ‘virus diagnostics’ [20] phase, it would be desirable to focus this phase on the localized area in the file rather than the entire file (we stress that this phase is usually both very resource-expensive and failure-prone, especially as the paradigm of integrity checking is typically used when not much information is available about the attacking virus). In both above scenarios, in addition to detecting that after the data was detected to be corrupted, some potentially expensive procedure is required to deal with the corruption. For instance, in the case of software file download, the download procedure needs to be repeated from scratch; and in the case of stored data integrity, the impact of the corruption needs to be carefully analyzed so to potentially recover the data, sometimes triggering an expensive, human-driven, virus diagnostics procedure. Thus, in these scenarios, in addition to detecting that the data was corrupted, it would be of interest to obtain some information about the location of such corruptions (i.e., a relatively small area that includes all corrupted data blocks). For our two scenarios, such information would immediately imply savings in communication complexity (as only part of the download procedure is repeated), and reduce human resource costs (as the virus diagnostic phase will just focus on the infected data). This motivates us to formally define and investigate a new notion for cryptographic hashing, called corruption-localizing hashing, that naturally extends cryptographic hashing to achieve such goals. Our contribution. Extending a concept put forward in [8], we formally define and investigate corruptionlocalizing hashing schemes (consisting of a hashing algorithm and a localization algorithm), defined as a natural generalization of collision-intractable hashing functions. With our formal definition of corruptionlocalizing hashing we define three important metrics: localization factor, tag length and localization running time, to capture the effectiveness of the localization, and efficiency of the system in terms of storage and time complexity, respectively. Localization factor is the ratio of the size of the area that is output by the localization algorithm to the size of corrupted area, where the former is required to contain the latter. We observe that simple techniques imply corruption-localizing hashing schemes with linear localization factor, or with small localization factor but with either a large localization running time or a large tag length. We then target the construction of hashing schemes that achieve sub-linear localization without significantly increasing tag length or running time. Our main results are two schemes with provable corruption-localization whose properties are detailed in Figure 1, where HS is presented for constant v and the general case is stated in Theorem 1. Note that our schemes significantly improve the localization of v ≥ 1 corruptions, at the cost of only slightly increasing storage complexity and running time of a conventional collision-resistant hash function. For instance, when v is constant, our first scheme, based on any collision-intractable hash function, achieves sub-linear localization factor and logarithmic tag length. Moreover, our second scheme, based on any collision-intractable keyed hash function, has constant localization factor and poly-logarithmic tag length. Using our schemes, in the software downloading scenario above, one can first obtain the (maybe corrupted) file and its tag (authentic), then use the latter to localize the corrupted parts and finally request retransmission of the localized parts only. Here, the tag used by our schemes is short and thus its authenticity can be guaranteed with small redundancy by standard error-correcting techniques (or, in certain applications, using a low-capacity channel). Previous work. The concept of localization is clearly not new, and can be considered as intermediate between the two concepts of detection and correction, which are well studied, for instance, in the coding theory 2

Scheme Trivial1 Trivial2

Localization factor O(n/v) 1

Storage complexity O(1) nσ

Original Hash Function

[8]

O(1)

O(σ log n)

cr

HS HS KHS

O(nc ) O(nd ) O(v 3 )

O(σ log n) O(σ log2 n) O(σv 2 λ logv n)

cr cr cr-keyed

Remark

Constraint

cr |S| < n/4 for some c < 1 |S| < n/4 for any 0 < d < 1 |S| < n/2(v + 1) |S| < n/2(v + 1)

Fig. 1. Asymptotical performance of 2 trivial schemes detailed at the end of Section 2, of a previous result from [8] for a single corruption, of 2 instantiations of our first scheme HS, and of our second scheme KHS for v corruptions. The term ‘cr’ (resp., ‘crkeyed’) is an abbreviation for ‘collision-resistant’ (resp., ‘collision-resistant, keyed’). Also, n denotes the file length, λ a security parameter that can be set = O(log1+ϵ n), for some ϵ > 0, σ the output length of the (atomic) collision-resistant (keyed) hash function, and |S| denotes the size of the largest corruption returned by the adversary. The value v for HS in the table is assumed to be constant; the general case can be found in Theorem 1.

and watermarking literatures. In general terms, localization is expected to provide better benefits and demand more resources than detection and provide worse benefits and demand less resources than correction, where, depending on applications and on benefit/resource tradeoffs, one concept may be preferable over the other two. Moreover, our paper differs crucially from research in both fields of coding theory and watermarking in that it specifically targets constructions based on cryptographic hash functions, and their applications. This difference translates in different construction techniques, security properties (as the collision-intractability and corruption-localization of cryptographic hash functions and the correction property in coding theory are substantially different properties), and adversary models (typically, in coding theory one considers arbitrary changes which can be modeled as unbounded adversaries, while we only consider polynomial-time bounded adversaries). By definition, the collision-intractability property of cryptographic hash functions already provides a computational version of the detection property but falls short of providing non-trivial localization, which we target here. We also note that several aspects in the mentioned example applications have also been studied from various angles. A first example is from [10] which studied the security of software download in mobile e-commerce. This paper and follow-up ones mainly focus on software-based security and risks involved in this procedure. A second example is from [4], which introduced a theoretical model for checking the correctness of memories. This paper and follow-up ones do not target constructions based on cryptographic hash functions, and the constructions exhibit similar differences and tradeoffs with our paper, as for the previously mentioned detection and correction concepts. A third example, apparently the closest line of research to the one from our paper, is from (non-adaptive) combinatorial group testing [9]. In this area, the goal is to devise combinatorial tests to efficiently find which objects out of a pool are defective. Note that testing whether a collision-resistant hash function maps two messages to the same tag could be considered a combinatorial test, and thus the technique from this area might be applicable to our problem. However, one main crucial difference here is that combinatorial group testing refers to same-size objects, while in this paper we recognize that practical corruptions may have very different sizes. Thus, even the best approaches from this area (exactly finding w defective objects out of a pool of n using O(w2 log n) storage) do not scale √ well as a single corruption, as defined in our model, may imply w = ω( n) and thus super-linear storage, which is worse than the Trivial2 construction in Figure 1. Other important differences include the following: this area implements the above correction concept, while our paper focuses on localization; moreover, our 3

paper works out the exact security analysis of the hashing functions, while the combinatorial group testing area only focuses on combinatorial aspects. Overall, the closest previous result to ours appeared in [8], which informally introduced a notion equivalent to corruption-localization hashing, for the case of a single corruption. One of their schemes satisfies our formal definition in the case of a single corruption, and is a special case of our first scheme. We stress that the extension to multiple corruptions is quite non-trivial both with respect to the formal definition (see Section 2) and with respect to the constructions and proofs (see Sections 3, 4).

2

Definitions and Model

We assume familiarity with families of (conventional and keyed) cryptographic hash functions and pseudorandom function families. Here, we present our new notions and formal definitions of corruption-localizing hash schemes. Corruption-Localizing Hashing: Notations. We assume that the input x to a (keyed) hash function consists of a number of atomic blocks (e.g., a bit or a byte or a line); let x[i] denote the i-th block of x; i is called index of x[i]; let x[i, j] denote the sequence of consecutive blocks x[i], x[i+1], . . . , x[j −1], x[j], also called a segment. In general, for S = {i1 , · · · , it } ⊆ {0, · · · , n − 1}, define x[S] = x[i1 ]x[i2 ] · · · x[it ]. A sequence of segments (x[i1 , j1 ], · · · , x[ik , jk ]) is also called a segment list. We define a left cyclic shift operator L for x by L(x) = x[1]x[2] · · · x[n − 1]x[0]. Iteratively applying L, we have Li (x) = x[i] · · · x[n − 1]x[0] · · · x[i − 1] for any i ≥ 0. For a set S, |S| denotes the number of elements in it. For any (possibly probabilistic) algorithm A, an oracle algorithm is denoted as AO , where O is an (oracle) function, and the notation a ← A(x, y, z, . . .) denotes the random process that runs algorithm A on input x, y, z, . . ., and denotes the resulting output as a. Corruption-Localizing Hashing: Formal Model. Our generalization of collision-intractable hash functions into hash schemes and keyed hash schemes is in having, in addition to the hashing algorithm, a second algorithm, called the localizer, which, given a corrupted input x′ and the hash value (also called tag) for the original input x, returns some indices of input blocks. If strings x and x′ are a message (or file) x and its corrupted version x′ , then the localizer’s output are indices of all corrupted segments of the input file. This improves over conventional hashing which typically reveals that a corruption happened, but does not offer any further information about which input blocks it happened at. To measure the quality of the localization, we introduce a parameter, called localization factor, that determines the accuracy of localizer and is defined (roughly speaking) as the ratio of the size of the localizer output to the size of the actual corrupted blocks. (Note that since the file size is measured in terms of the number of blocks, we only need to consider the number of blocks.) In this model, we only consider a replacement attack: given input x, adversary replaces up to v segments of x by new ones while each replaced segment preserves its original length (i.e., containing the same number of blocks). Our model allows each segment to contain arbitrary and unknown number of blocks. This adversary model well captures the applications described in the introduction. For instance, when a software file is downloaded over the Internet some packets (regarding the payload in one packet as one block) get noisy or even lost. In rewriting infections by viruses, some lines in an executable might be replaced by malicious commands. Our objective for localization is to output a small set T of indices that contains the corrupted blocks. Then, in case of software download, we only need to request retransmission of blocks in T . We will be mainly interested in partially corrupted files, for which a localization solution for the applications mentioned in the introduction is of much more interest. Thus, when designing our schemes, we assume a (sufficiently large) upper bound β on the size of the maximum corruption segment. 4

Before describing the model, we define the difference between x and its corrupted version x′ . We generally consider the case where x′ is corrupted from x by v segments (instead of blocks). Given as input two n-block strings x and x′ , we define a function Diffv as follow. For S ⊂ {0, · · · , n − 1}, let S = {0, · · · , n − 1}\S. ∑ Diffv [x, x′ ] = min vi=1 |Si |, where each Si ⊂ {0, · · · , n − 1} is a segment, and the minimum is over all possible {Si }vi=1 such that x[∪vi=1 Si ] = x′ [∪vi=1 Si ]. Here Si ⊆ {0, · · · , n − 1} and thus it might be empty, and x[∪vi=1 Si ] and x′ [∪vi=1 Si ] are strings x and x′ , respectively, with segments Si , i = 1, · · · , v removed. Intuitively, Diffv [x, x′ ] is the minimal total size of v segments that an adversary can modify in order to change x to x′ . For example, let v = 2, n = 11, x = 00000000000, and x′ = 10100000100, and assume x′ is the corrupted version of string x. We note the minimal size of two segments in x that one can modify in order to change x to x′ is 4: S1 = {0, 1, 2}, S2 = {8} and Diff2 [x, x′ ] = 4. Gen∑ erally, we say Si ⊂ {0, · · · , n − 1}, i = 1, · · · , v achieve Diffv [x, x′ ], if vi=1 |Si | = Diffv [x, x′ ] and x[∪vi=1 Si ] = x′ [∪vi=1 Si ]. Note Diffv [x, x′ ] can always be computed in time O(nv−1 ) by searching for the rightmost element of segment Si and verifying if x[∪vi=1 Si ] = x′ [∪vi=1 Si ]. On the other hand, Diffv [x, x′ ] is mainly required in the definition of the security experiment below but need not be calculated in our corruption-localization algorithms. So we do not require an efficient algorithm for computing Diffv [x, x′ ]. We then define a hash scheme as a pair HS = (CLH, LOC), where CLH is an algorithm that, on input an n-block string x (and, implicitly, a security parameter) returns a string tag, and LOC is an algorithm that, on input an n-block string x′ and a string tag, returns a set of indices T ⊆ {0, · · · , n − 1}. Similarly, we define a keyed hash scheme as a pair (CLKH, KLOC), where CLKH is an algorithm that, on input an n-block string x (and, implicitly, security parameter λ), a λ-bit string k, returns a string tag, and KLOC is an algorithm that, on input an n-block string x′ , a λ-bit string k, and a string tag, returns a set of indices T ⊆ {0, . . . , n − 1}. We now formally define the corruption-localization properties of hash schemes and keyed hash schemes, using three additional parameters: v, the number of corrupted segments, β the upper bound on the number of corrupted blocks in the largest corruption segment, and α the lower bound on the ratio of the number of blocks T that is the output of the localizing algorithm to Diffv [x, x′ ]. Definition 1. Let HS = (CLH, LOC) be a hash scheme and KHS = (CLKH, KLOC) be a keyed hash scheme. For any t, ϵ, α, β, v ≥ 0, the hash scheme HS is said (t, ϵ, α, β, v)-corruption-localizing if for any algorithm A running in time t and returning corruption segments of size ≤ β, the probability that experiment HExpHS,A,hash (α, v) (defined below) returns 1 is at most ϵ. For any t, q, ϵ, α, β, v ≥ 0, the keyed hash scheme KHS is said (t, q, ϵ, α, β, v)-corruption-localizing if for any oracle algorithm A running in time t, making at most q oracle queries, and returning corruption segments of size ≤ β, the probability that experiment KExpKHS,A,keyh (α, v) (defined below) returns 1 is at most ϵ. HExpHS,A,hash (α, v) 1. (x, x′ ) ← A(α, v) 2. tag ← CLH(x) 3. T ← LOC(v, x′ , tag) 4. if x[T ] ̸= x′ [T ] then return: 1 5. if |T | > α · Diffv [x, x′ ] then return: 1 else return: 0.

KExpKHS,A,keyh (α, v) 1. k ← {0, 1}λ 2. (x, x′ ) ← ACLKHk (·) (α, v) 3. tag ← CLKHk (x) 4. T ← KLOC(k, v, x′ , tag) 5. if x[T ] ̸= x′ [T ] then return: 1 6. if |T | > α · Diffv [x, x′ ] then return: 1 else return: 0. 5

In both above experiments, the adversary is successful if it either prevents effective localization (i.e., one of the modified blocks is not included in T ), or forces the scheme to exceed the expected localization factor (i.e., |T | > α · Diffv [x, x′ ]). Corruption-Localizing Hashing: metrics of interest. We use the following three main metrics of interest to evaluate and compare corruption-localizing hash schemes and keyed hash schemes. First, the parameter α in the above definition is called localization factor. Note that a collision-resistant hash function implies a trivial corruption-localizing hash scheme with localization factor at least α = n/v. This is by simply defining the algorithm Loc to return all blocks {0, . . . , n − 1}, where n is the length of the input to the hash function CLH. (This is scheme Trivial1 in Figure 1.) Clearly, we target better schemes with localization factor o(n/v) or even constant. A second metric of interest is the output length of the hash function, also called tag length. Note that a corruption-localizing hash scheme with localization factor 1 and efficient localizer running time can be simply constructed as follows: the tag is obtained by calculating the hash of each block in the input message individually (if a block is not small such as a long line); the localizer returns the indices where the hashes differ. (This is scheme Trivial2 in Figure 1.) Clearly, such a scheme is not interesting since the tag length is linear in n. Instead, we target schemes where the tag length is logarithmic or poly-logarithmic in n. A third metric of interest is the localizer’s running time as a function of n, where n is the length of the input to the function CLH (or CLKH). Our schemes only slightly decrease the efficiency of the atomic collision-resistant hash function used.

3 A Corruption-Localizing Hashing Scheme In this section we design a corruption-localizing hash scheme based on any collision-resistant hash function. Our scheme can be instantiated so that it localizes up to v corruptions in an n-block file, while satisfying a non-trivial localization factor, very efficient storage complexity and only slightly super-linear runtime complexity. For instance, when v is constant (as a function of n), it has localization factor O(nc ), for some c < 1, and O(log n) storage complexity, or localization factor O(nd ), for any 0 < d < 1, and O(log2 n) storage complexity. (See Theorem 1 and related remarks for formal and detailed statements.) In the rest of the section, we start with an informal description and a concrete example for the scheme, and then conclude with the formal description and a sketch of proof of its properties. A N INFORMAL DESCRIPTION . At a very high level, our hash algorithm goes as follows. A collection of block segments from the n-block file x are joined to create several segment lists, and the collision resistant hash function hλ is applied to compute a hash tag for each segment list. The localizer, on input a file x′ with up to v corruptions, computes a hash tag on input the same segment lists from file x′ , and eliminates all segment lists for which the obtained tag matches the tag returned by the hash algorithm. The remaining blocks are returned as the area localizing the v corruptions. The hard part in the above high level description is choosing block segments and segment lists in such a way to achieve desired values for the localization, storage and running time metrics. Here, our approach can be considered a non-trivial extension of the scheme from [8] that provides non-trivial localization for a single corruption (i.e., v = 1). We start by briefly recalling the mentioned scheme, and, in particular, by highlighting some of the properties that will be useful to describe our scheme. A single-corruption scheme. The scheme in [8] follows the above paradigm in the case v = 1 and localizes any single corrupted segment S (of up to n/4 blocks) with localization factor 2, using O(log n) storage and running in O(n log n) time. There, n = 2w for some positive integer w. Now, assume that S satisfies 2w−i0 −1 < |S| ≤ 2w−i0 , and let i = i0 − 1. The n-block file x is split into 2i consecutive segments, each 6

containing 2w−i blocks. Then, the 2i segments are grouped into 2 segment lists such that the ℓ-th segment is assigned to segment list ℓmod2. Thus, each of the 2 segment lists contains 2i−1 segments. So far, the idea is that if, for some i, one of the 2 segment lists contains the entire corruption, then the localization is restricted to the segment list containing the entire corruption. However, it may happen that the corruption lies in one intersection of the two segment lists, in which case the above 2 tests do not help. To take care of this situation, the same process is repeated for a cyclic shift by 2w−i−1 blocks of file x. Then, the corruption will intersect at most 3 out of 4 segment lists, and the remaining one can be considered “corruption-free”. This already provides some localization, but further hash tags are needed to achieve an interesting localization factor. In particular, because the corruption size and thus the value i0 are not known, the above process is repeated for i = 1, . . . , w − 1 from the hash algorithm, and until such i0 is found from the localizer. Our multiple-corruption scheme. The natural approach of using the same scheme for v ≥ 2 fails because an attacker can carefully place 2 corruptions so that one intersects both segment lists generated from file x and the other one intersects both segment lists generated from the cyclic shift of file x. This is simple to realize for any specific i, and can be realized so that the intersections happen for all i = 1, . . . , w, by enforcing the intersections when i = 1. We avoid this problem by increasing the number of segment lists. Specifically, we write n = z w for some positive integers z, w satisfying z > v (where parameter z has to be carefully chosen), and repeat the same process by using z segment lists rather than 2, for all i = 1, . . . , w. However, not any value for z would work: because each corrupted segment can intersect up to 3 segment lists (2 generated from file x and 1 from the cyclic shift of x, or viceversa), it turns out that, for instance, choosing z ≤ 3v/2 would still allow for one (less obvious) placement of the v corruptions by the attacker so that no segment lists can be considered “corruption-free”. Moreover, choosing any z > 3v/2 may result in a less desirable localization factor. We deal with these problems by increasing the number of cyclic shifts, denoted as y, of the original file x: more precisely, we repeat the process for each file obtained by shifting x by n/y blocks. We can show that these two modifications suffice to maintain efficiency in storage and time complexity, to achieve effective localization (or else the collision-resistance of the original hash function is contradicted) and to achieve a non-trivial localization factor. To prove the latter claim, we show that: (1) over all cyclic shifts, v corruptions intersect with at most ≤ v(y + 1) segment lists in total; (2) hence, there exists one cyclic shift of x, for which these v corruptions intersect at most ⌊v(y + 1)/y⌋ segment lists; (3) for each i = 1, . . . , w − 1, the set Ti of blocks that have not been declared “corruption-free” satisfies |Ti | ≤ n · ν i for some ν < 1 and 0 ≤ i ≤ i0 , where i0 is such that |Sa | ≤ z w−i0 /y for all corrupted segments Sa and |Sa | > z w−i0 −1 /y for some corrupted segment Sa . Here, we note that fact (3) is proved using facts (1) and (2) and implies that the final output Tw−1 from the localizer is a “good enough” localization of the v corruptions. A CONCRETE EXAMPLE . We discuss (and depict in Figure 2) a concrete example of our scheme, starting with a file x = x[0] · · · x[63], containing n = z w = 43 = 64 blocks, with the parameter settings z = 4, w = 3. Our scheme consists of tag algorithm CLH1 (see left side of Figure 2) and localization algorithm LOC1 (see right side of Figure 2). Hash Algorithm. The algorithm CLH1 consists of w − 1 = 2 stages and can be considered as a sequence of computations of hash tags based on the following equations, for different values of ℓ, i: tagℓ,i,0 = hλ (⋆), tagℓ,i,1 = hλ (⋄), tagℓ,i,2 = hλ (•), tagℓ,i,3 = hλ (◦),

(1)

where ⋆, ⋄, •, ◦ are 4 classes of segments, that are differently obtained from x at each application of these equations. 7

Fig. 2. The HS scheme for n = z w = 64, z = 4, w = 3, y = 2.

Stage one. x is split into z 1 = 4 segments of equal size n/z 1 = z w−1 = 16 (row 1 in the figure). That is, (0, · · · , 63) = ⋆|| ⋄ || • ||◦, and the equations in (1) are applied for ℓ = 0, i = 1. Now, set parameter y as = 2. Next, left cyclic shift x by 1/y segment size (see row 2). That is, shift z w−1 /y = 8 blocks. The result w−1 is Lz /y (x) = L8 (x) = (8, 9, · · · , 63, 0, · · · , 7). Again split L8 (x) into z 1 = 4 blocks ⋆|| ⋄ || • ||◦ and apply the equations in (1) for ℓ = 1, i = 1. In this example, y = 2. If y ≥ 3, we need to further consider w−i Lℓ2 /y (x) for ℓ ≤ y − 1 similarly. In this scenario, cases ℓ = 0, 1 are similarly as above. Stage two. Here, x is split into z 2 = 16 segments of each size n/z 2 = 64/16 = 4 (see row 3). Then assign all segments into 4 classes ⋆, ⋄, • and ◦. ⋆ contains segments 0, 4, 8, ..., 48; ⋄ contains segments 1, 5, 9, ..., 49; • contains segments 2, 6, 10, ..., 50; ◦ contains segments 3, 7, 11, ..., 51. Then we apply the equations in (1) for ℓ = 0, i = 2. Next, as in Stage one, we cyclicly shift x by z w−2 /y = 4/2 = 2 blocks (see row 4). That w−2 is, we compute Lz /y (x) = L2 (x) = (2, 3, · · · , 63, 0, 1). We similarly classify L2 (x) into classes ⋆, ⋄, • and ◦ and apply the equations in (1) for ℓ = 1, i = 2. Localization Algorithm. Suppose x is corrupted in a file x′ by changing blocks 7, 8. We compute a set T ⊆ {0, · · · , 63} that contains 7, 8 but |T | is not large. There are two stages. Initially, set T0 = {0, · · · , 63}. Stage one. Similarly as for x, split x′ into ⋆|| ⋄ || • ||◦ and compute tag ′ 0,1,j , j = 0, · · · , 3. Then since tag ′ 0,1,j = tag 0,1,j , j = 1, 2, 3, it follows that ⋄, •, ◦ are all uncorrupted (see row 1); otherwise, hλ is not collision-resistant. Then we can update T0 = T0 \{16, · · · , 63} = {0, · · · , 15}. By verifying tag ′ 0,1,0 ̸= tag 0,1,0 , we know ⋆ contains a corruption. Then we consider a shift L8 (x′ ) of x′ , i.e., (8, · · · , 63, 0, · · · , 7) (see row 2). Let T1 = T0 . Compute tag ′ 1,1,j , j = 0, · · · , 3. Since tag ′ 1,1,j = tag 1,1,j for j = 1, 2, then T1 = T1 \{24, · · · , 55} = {0, · · · , 15} remains unchanged. 8

Stage two. Consider row 3 in Figure 2. Split x′ into z 2 = 16 segments. Set T2 = T1 . Compute tag ′ 0,2,j , j = 0, · · · , 3. Since tag ′ 0,2,j = tag 0,2,j , we can update T2 = T2 − {0, · · · , 3} − {16, · · · , 19} − {32, · · · , 25} − {48, · · · , 51} = {4, · · · , 15}. Similarly, from tag ′ 0,2,3 = tag 0,2,3 , we can update T2 to T2 = {4, · · · , 11}. Next, consider a shift L2 (x′ ) of x′ (see row 4 in Figure 2). Compute tag ′ 1,2,j , j = 0, · · · , 3. Since tag ′ 1,2,j = tag 1,2,j for j = 0, 2, 3, we can update T2 by removing indices not in ⋄. The result is T2 = {4, · · · , 11} − {2, · · · , 5} − {10, · · · , 17} = {6, 7, 8, 9}. So the localization factor here is α = 2. F ORMAL DESCRIPTION AND PROOFS . Our formal presentation (in Fig. 3) is a generalization of the above concrete example, where the classes ⋆|| ⋄ || • ||◦ are replaced by symbol Sℓ,i,j . The scheme’s properties are formally described in the following theorem.

The algorithm CLH1 : On input x, |x| = n, and parameters (z, y), do the following: - Randomly choose hλ from Hλ - For i = 1, . . . , w − 1, and ℓ = 0, . . . , y − 1, set s = ℓ · z w−i /y and compute xℓ,i = Ls (x) split xℓ,i into segments Bℓ,i,0 ∥ · · · ∥Bℓ,i,zi −1 of equal length for j = 0, . . . , z − 1, compute segment list Sℓ,i,j = (Bℓ,i,j ∥Bℓ,i,j+z ∥ · · · ∥Bℓ,i,j+zi −z ) compute tagℓ,i,j = hλ (Sℓ,i,j ) - Output: tag = {tagℓ,i,j | ℓ ∈ {0, . . . , y − 1}, i ∈ {1, . . . , w − 1}, j ∈ {0, . . . , z − 1} } ∪ {n, z, y, desc(hλ )}. The algorithm LOC1 : On input x′ , |x′ | = n, tag, and parameters (z, y), do the following: - Let tag = {tagℓ,i,j | ℓ ∈ {0, . . . , y − 1}, i ∈ {1, . . . , w − 1}, j ∈ {0, . . . , z − 1} } ∪ {n, z, y, desc(hλ )}. - Set T0 = {0, . . . , n − 1}. - For i = 1, . . . , w − 1, set Ti = Ti−1 for ℓ = 0, . . . , y − 1, and j = 0, . . . , z − 1, ′ ′ compute Bℓ,i,j , Sℓ,i,j from x′ as done for Bℓ,i,j , Sℓ,i,j from x in CLH1 ′ let Iℓ,i,j be the set of indices for Bℓ,i,j w−i −1 i.e., Iℓ,i,j = {ℓ · z y + j · z w−i , . . . , ℓ · z w−i y −1 + j · z w−i + z w−i − 1} z i−1 −1 ′ if hλ (Sℓ,i,j ) = tagℓ,i,j then update Ti = Ti \ ∪t=0 Iℓ,i,j+zt . - Output: Tw−1 . Fig. 3. The Corruption-Localizing Hash Scheme HS

Theorem 1. Let z, y, v, λ, w be positive integers such that y | z, v < yz(y + 1)−1 , n = z w , and let β = n/2y. Assume H = {Hλ }λ∈N is a (t, ϵ)-collision-resistant family of hash functions from {0, 1}p1 (λ) → {0, 1}σ . Then there exists a (t′ , ϵ′ , β, v)-corruption-localizing hash scheme HS, where ϵ′ = ϵ and t′ = t + O(tn (H) · yz logz n), where tn (H) is the running time of functions from Hλ on inputs of length n. Moreover, HS has localization factor α = ⌊v(y + 1)/y⌋−1 zy · nlogz ⌊v(y+1)/y⌋ , tag length τ = 3 log n + σzy logz n + |desc(H)|, and runtime complexity ρ = O(tn (H) · zy logz n), where |desc(H)| is an upper bound on the description size of functions from Hλ . Remarks and parameter instantiations. The condition n = z w is for simplicity only and can be removed by a standard padding. When v is constant, we can always choose constants z, y such that v < yz(y + 1)−1 . It follows that in this setting it always holds that α = O(nc ), for some constant c < 1. So our scheme does provide a non-trivial localization (in terms of file size n): α sublinear, τ logarithmic and ρ almost linear. 9

−1

Moreover, by setting y = v + 1 and z = log n, we have α = v −1 (v + 1) log n × nlog log n×log v . By simple calculation, we have that for any 0 < c < 1, α = O(nc ), τ = O(log2 n) and ρ = O(n log2 n). That is, for any 0 < c < 1, HS localizes any v corruptions up to a sub-linear factor O(nc ) with only poly-logarithmic tag length and slightly super-linear running time, where v can be up to c′ log n, for c′ < c. Finally, by setting y = z = 2 and v = 1, we obtain α = 4, τ = (3 + 4σ) log n and ρ = 4nσ log n; i.e., HS localizes a single corruption up to a small constant factor with logarithmic tag length and slightly super-linear running time. Note that one scheme in [8] considered this special case and has a result essentially matching ours. Proof idea of Theorem 1. As ρ and τ can be checked by calculation, and effective localization can be seen to directly follow from the collision-intractability of the original hash function, here we only focus on justifying the localization factor α. Obviously, Ti is related to the size of each corrupted segment Sa . Let i0 be such that each |Sa | ≤ nz −i0 /y but some |Sa | ≥ nz −i0 −1 /y. If we are able to show that |Ti | ≤ n · ν i for some ν < 1 and all 0 ≤ i ≤ i0 , then we have that |Tw−1 | ≤ |Ti0 | ≤ n · ν i0 and thus ∑ ∑ |Sa | · z i0 logz (zν) ≤ zy |Sa | · nlogz (zν) , |Tw−1 | ≤ nz −i0 −1 /y · zy(zν)i0 ≤ zy a

a

which is a sub-linear factor in n since zν < z. So we need to show an upper bound of |Ti | can decrease with i by some factor ν < 1 for i ≤ i0 . We demonstrate the technical idea for this using the example in Fig. 2. Here, the corrupted segment is S1 = {7, 8}. Then, it holds that i0 = 2. Consider row 1 and 2 in Fig. 2. Since S1 has a size 2 and segment size is z w−1 , the event that S1 is intersecting with two neighboring segments can occur in at most one of x and L8 (x). In our example, in L8 (x), S1 intersects with two segments {⋆, ◦}. So in x and L8 (x), there are at most 3 segments in total intersecting with S1 (in general, these are at most v(y + 1)). So one of x and L8 (x) contains at most ⌊3/2⌋ = 1 corrupted segments (in general, these are ⌊v(y + 1)/y⌋). In our example, x contains 1 corrupted segment. So |T1 | = n/z = 16 (in general, |T1 | = ⌊v(y +1)/y⌋·n/z). Now we only consider Stage two (row 3 and 4 in Fig. 2). Again, since S1 has size 2 and segment size is z w−2 = 4, the event that S1 is intersecting with two neighboring segments can occur in at most one of x and L2 (x). The remaining part in this stage is to follow the idea in stage one. We obtain that |T2 | = 4 (in general, T2 = ⌊v(y +1)/y⌋·|T1 |/z = (⌊v(y +1)/y⌋/z)2 ·n, where ⌊v(y +1)/y⌋/z < 1 by assumption). The formal proof consists of carefully generalizes the idea in this description (see Appendix B for details).

4 A Corruption-Localizing Keyed Hashing Scheme In this section we propose a corruption-localizing keyed hash scheme starting from any collision-resistant keyed hash function. Our scheme improves the previous (not keyed) scheme on the localization factor for an arbitrary number of corruptions, and on the range of the number of corruptions for which it provides non-trivial localization. In particular, for a constant number of corruptions, it provides essentially optimal (up to a constant factor) localization, at the expense of small storage complexity and only a small increase in running time. (See Theorem 2 and related remarks for the formal statement.) In the rest of the section, we start with an informal description, then give a concrete example, the formal description and a sketch of proof of its properties. A N INFORMAL DESCRIPTION . By using keyed hash functions in our previous scheme, we do obtain a corruption-localizing keyed hash scheme. The following construction, however, makes a more intelligent use of the randomness in the key resulting in significant improvements both on the localization factor and on the range for the number of corruptions, with only a slightly worse performance in storage and time complexity. 10

At a very high level, our keyed hash algorithm goes as the hash algorithm of scheme HS, with the following differences. The new algorithm uses the secret key shared with the localizer (and unknown to the attacker) as an input to a pseudo-random function that generates pseudo-random values. These latter values are used as colours associated with each block segment of each cyclic shift of file x (including the file x itself). Then, segment lists are created so that each segment list contains all block segments of a given colour. In other words, the generation of segment lists from the block segments is done (pseudo-)randomly and in a way that it can be done by both the hash algorithm and the localizer, but not by the attacker (as the key is unknown to the attacker and the hash tags are further encrypted using a different portion of the key). The reason for this pseudo-random generation of segment lists is that the deterministic generation done in scheme HS allowed the attacker to place the corruptions in a way to maximize the number of intersections with segment lists. This resulted in a localization factor still polynomial in n (even though the polynomial could be made as small as desired at moderate losses in terms of storage and time complexity). Instead, the pseudo-random generation of the segment lists makes it much harder for the attacker to place corruptions so to intersect a large number of segment lists, and is crucial to achieve constant localization factor (except with negligible probability).

Fig. 4. The CLKH scheme for n = z w = 64, z = 4, w = 3. (Note: L12 (x) and L3 (x) are not shown in the figure.)

A CONCRETE EXAMPLE . In Fig. 4 we illustrate an example for scheme KHS analogous to the one in the previous section for scheme HS. We again use file x = x[0] · · · x[63], but we now consider v = 3 and n = (v + 1)w = 43 = 64 and w = 3. As before, segments are somehow assigned to classes ⋆, •, ⋄, ◦, and analogues of the equations in (1) are used to compute hash tags, the differences being here that the hash 11

functions used are keyed functions, the assignment of the segments to the classes is probabilistic, and the tags are further encrypted using a key available to the localizer. Specifically, scheme HS can be regarded as assigning the classes to the segments periodically while the current scheme assigns a class to each segment randomly (see left part of Fig. 4). Now, let x′ be the corrupted version of x, where blocks 7, 8, 40 are changed. The localization algorithm (see right part of Fig. 4) returns T2 = {6, 7, 8, 9, 40}, thus resulting in a localization factor α = 5/3 = 1.67. Detailed computation is as follows. Tag Algorithm. CLKH algorithm for x can be understood by looking at the left side of Fig. 4. It uses a fixed key k = k1 |k2 |k3 but two new nonces µ1 , µ2 for each x. It consists of w − 1 = 3 − 1 = 2 stages. Stage one. x is split into (v + 1)1 = 4 segments of equal size n/(v + 1)1 = 16 (row 1 in the figure). In HS scheme, we compute a hash tag for each segment. Here, the algorithm consists of λ coloring experiments. In experiment z = 1, · · · , λ − 1, assign a random color in {C0 , C1 , · · · , C3 } to each segment, where the fresh randomness is from a pseudorandom stream psr1 = fk1 (v1 ). In our example, we use ⋆, ⋄, •, ◦ to denote different colors to be consistent with scheme HS. That is, HS in the last section can be regarded as assigning the colors to the segments periodically while the current scheme assigns a color to a segment randomly. Use SBi to denote segment i, i = 1, 2, 3, 4. In the zth experiment (row 1 in Fig. 4), ⋆ = {SB1 , SB3 }, ⋄ = {SB2 }, • = empty, ◦ = {SB4 }. Compute tag0,1,c,z as tag0,1,C0 ,z = psr2 ⊕ hλ (k3 ; ⋆),

tag0,1,C1 ,z = psr2 ⊕ hλ (k3 ; ⋄),

tag0,1,C2 ,z = psr2 ⊕ hλ (k3 ; •),

tag0,1,C3 ,z = psr2 ⊕ hλ (k3 ; ◦).

Here psr2 = fk2 (v2 ) is used to mask hλ (k3 ; ) so that the attacker can not learn the randomness used in the assignment of SBi . 4 For each tagℓ,j,c,z , a fresh portion of psr2 is used as a mask. Next, left cyclic shift x by 1/(v + 1) segment size (see row 2). That is, shift (v + 1)w−2 = 4 blocks. Note in HS scheme, we shift (v + 1)w−1 /y = 16/2 = 8 blocks. In the CLKH scheme here, we generally use y = (v + 1). This change has no special reason and it is just for simplicity. The shift result is w−1 L(v+1) /(v+1) (x) = L4 (x) = (4, 5, · · · , 63, 0, 1, 2, 3). Again split L4 (x) into (v + 1)1 = 4 segments. Run coloring experiment z = 1, · · · , λ. In experiment z, randomly assign a color c ∈ {C0 , C1 , C2 , C3 } and computes tags tag1,1,c,z as tag1,1,C0 ,z = psr2 ⊕ hλ (k3 ; ⋆),

tag1,1,C1 ,z = psr2 ⊕ hλ (k3 ; ⋄),

tag1,1,C2 ,z = psr2 ⊕ hλ (k3 ; •),

tag1,1,C3 ,z = psr2 ⊕ hλ (k3 ; ◦).

We can similarly compute tags tagℓ,1,c,z for ℓ = 2, 3. Note in Fig. 4, the row for L12 (x) is not presented. Stage two. In this stage, x is split into (v + 1)2 = 16 segments of each size n/(v + 1)2 = 64/16 = 4 (see row 4). In coloring experiment z = 1, · · · , λ, assign each segment randomly with one of colors {C0 , C1 , C2 , C3 } (note as {⋆, ⋄, •, ◦} is one-one corresponding to colors {C0 , · · · , C3 } in order, we sometimes simply use an element in {⋆, ⋄, •, ◦} to interchangeably denote a color or the set of segments under this color. In row 4, ⋆ contains segments SB0 , SB3 , SB8 , SB12 ; ⋄ contains SB6 , SB9 , SB13 ; • contains SB1 , SB5 , SB7 , SB11 , SB15 ; ◦ contains segment SB2 , SB4 , SB10 , SB14 . Then compute

4

tag0,2,C0 ,z = psr2 ⊕ hλ (k3 ; ⋆),

tag0,2,C1 ,z = psr2 ⊕ hλ (k3 ; ⋄),

tag0,2,C2 ,z = psr2 ⊕ hλ (k3 ; •),

tag0,2,C3 ,z = psr2 ⊕ hλ (k3 ; ◦).

For hλ , we use a secret key k3 only because it can be more efficiently realized than an unkeyed collision-resistant hash. Indeed, the former can be constructed under the existence of one-way function and any secure message authentication code is a realization of it while for the latter the currently known minimal assumption is a claw-free permutation.

12

Next, as in Stage one, we cyclicly shift x by (v + 1)w−2 /(v + 1) = 4/4 = 1 blocks (see row 5). That is, compute L1 (x) = (1, 2, · · · , 63, 0). In experiment z = 1, · · · , λ, randomly assign each segment in L1 (x) into one of colors {⋆, ⋄, •, ◦} and compute tag1,2,C0 ,z = psr2 ⊕ hλ (k3 ; ⋆),

tag1,2,C1 ,z = psr2 ⊕ hλ (k3 ; ⋄),

tag1,2,C2 ,z = psr2 ⊕ hλ (k3 ; •),

tag1,2,C3 ,z = psr2 ⊕ hλ (k3 ; ◦).

L2 (x) and L3 (x) are processed similarly (L3 (x) is not shown in the Figure). Localization Algorithm. Suppose x is corrupted in a file x′ by additive change. In this example, assume block 7, 8, 40 are changed. We compute a set T ⊆ {0, · · · , 63} that contains 7, 8, 40 but |T | is not large. There are two stages. Initially, set T0 = {0, · · · , 63}. Stage one. Similar to x, split x′ into z = 4 segments, do λ coloring experiments using the same random string fk1 (v1 ) as for x. In experiment z = 1, · · · , λ, compute tag ′ 0,1,c,z , c ∈ {C0 , · · · , C3 } (or simply {⋆, ⋄, •, ◦} without any confusion). For the coloring z at Row one, since tag ′ 0,1,c,z = tag 0,1,c,z for c = C1 , C3 , it follows that segment lists ⋄ and ◦ under this coloring are both uncorrupted (as their tags are respectively consistent with the stored tags and as hλ is not collision-resistant). So we can update T1 = T1 − {16, · · · , 31} − {48, · · · , 63} = {0, · · · , 15} ∪ {32, · · · , 47}. By verifying tag ′ 0,1,C0 ,z ̸= tag 0,1,C0 ,z , ⋆ contains a corruption and so we can not remove ⋆ from the candidate corrupt set T . Similarly, • can not be removed from T . Then we consider a shift L4 (x′ ) of x′ , i.e., (4, · · · , 63, 0, · · · , 3) (see row 2). Run coloring experiments to compute tag ′ 1,1,Cj ,z , j = 0, · · · , 3. Since in the experiment at Row two, tag ′ 1,1,Cj ,z = tag 1,1,Cj ,z for j = 1, 2, ⋆ and • are uncorrupted. we can update T1 = T1 − {20, · · · , 35} − {52, · · · , 63, 0, · · · , 3} = {4, · · · , 16} ∪ {36, · · · , 47}. Similarly, we can consider a shift L8 (x′ ) of x′ , i.e., (8, · · · , 63, 0, · · · , 7) (see row 3) and update T1 to T1 = {4, · · · , 16}∪{40, · · · , 47}. Then, L12 (x′ ) can be similarly done (not shown in the figure) and omitted. Stage two. Set T2 = T1 . Look at row 4 of Figure. Split x′ into (v + 1)2 = 16 segments. Run coloring experiments to compute tag ′ 0,2,Cj ,z , j = 0, · · · , 3. In our example, since tag ′ 0,2,C0 ,z = tag 0,2,C0 ,z , ⋆ is uncorrupted. So we can update T2 = T2 − {0, · · · , 3} − {12, · · · , 15} − {32, · · · , 35} − {48, · · · , 51} = {4, · · · , 11} ∪ {16} ∪ {40, · · · , 47}.

(2) (3)

Similarly, from tag ′ 1,2,C1 ,z = tag 1,2,C1 ,z , we can also update T2 but the result happens to be unchanged. Next, consider a shift L1 (x′ ) of x′ (see row 5 in Figure). Run coloring experiments to compute tag ′ 1,2,Cj ,z , j = 0, · · · , 3. In our example, since tag ′ 1,2,Cj ,z = tag 1,2,Cj ,z for j = 1, 2 we can update T2 to T2 = {5, · · · , 11} ∪ {16} ∪ {40}. In row 6, segments containing block 7, 8, 40 are signed to the same color (i.e., ⋄), the remaining colors are uncorrupted and can be removed from T2 . Thus, T2 can be updated to T2 = {6, 7, 8, 9, 40}. Finally, the localization factor here is α = |T2 |/3 = 1.67. F ORMAL DESCRIPTION . The formal presentation of our keyed hash scheme can be found in Fig. 5. The properties of this scheme are shown in the following theorem. Theorem 2. Let λ, v, w be positive integers such that v ≥ 2 and n = (v + 1)w , and define β = n/2(v + 1), and δ a function negligible in λ. Assume H = {Hλ }λ∈N is a (th , ϵh )-collision-resistant family of keyed hash functions from {0, 1}λ × {0, 1}p1 (λ) → {0, 1}σ and F = {fk }|k|∈N is a (tf , ϵf )-pseudo-random 13

The algorithm CLKH: On input k, x, |x| = n, do the following: - Randomly choose hλ from Hλ - Write k as k = k1 |k2 |k3 , randomly choose nonces µ1 , µ2 , and let psr1 , psr2 be sufficiently long number of pseudorandom bits obtained as psri = fki (µi ), for i = 1, 2; - For i = 1, . . . , w − 1, and ℓ = 0, . . . , v, compute xℓ,i and Bℓ,i,0 , . . . , Bℓ,i,(v+1)i −1 as done in CLH1 (in the case of x = y = v + 1) for z = 1, . . . , λ, for each j = 0, . . . , (v + 1)i − 1 randomly choose colour cℓ,i,j,z ∈ {C0 , . . . , Cv } and assign it to Bℓ,i,j , (using fresh pseudorandom bits from psr1 ) for c ∈ {C0 , . . . , Cv }, let Sℓ,i,c,z be the set of segments Bℓ,i,j (j ∈ {0, · · · , (v + 1)i − 1}) with assigned color cℓ,i,j,z = c compute tagℓ,i,c,z = hλ (k3 ; Sℓ,i,c,z ) ⊕ psr2 (Note: each hλ (k3 ; Sℓ,i,c,z ) for different (ℓ, i, c, z) is masked by a fresh portion of psr2 .) - Output: tag = {n, s, µ1 , µ2 , desc(hλ ), tagℓ,i,c,z | 0 ≤ ℓ ≤ v, 1 ≤ i < w, c ∈ {C0 , . . . , Cv }, 1 ≤ z < λ}. The algorithm KLOC: On input k, v, x′ , tag, where k = k1 |k2 |k3 , and tag = {n, s, µ1 , µ2 , desc(hλ ), tagℓ,i,c | 0 ≤ ℓ ≤ v, 1 ≤ i < w, c ∈ {C0 , . . . , Cv }, 1 ≤ z < λ}, do the following: - Set T0 = {0, . . . , n − 1} and compute psr1 , psr2 as in CLKH; - For i = 1, . . . , w − 1, set Ti = Ti−1 for ℓ = 0, . . . , v, c = C0 , . . . , Cv , and z = 1, . . . , λ, ′ compute Sℓ,i,c,z from x′ as done for Sℓ,i,c,z from x in CLKH above ′ let Iℓ,i,c,z be the set of indices from all segments in Sℓ,i,c,z ′ if psr2 ⊕ hλ (k3 ; Sℓ,i,c,z ) = tagℓ,i,c,z then set Ti = Ti \ Iℓ,i,c,z - Output: Tw−1 . Fig. 5. The Corruption-Localizing Keyed Hash Scheme KHS.

family of functions. Then the scheme in Fig. 5 is a (t′ , ϵ′ , β, v)-corruption-localizing keyed hash scheme KHS, where ϵ′ ≤ ϵh + ϵf + δ and t′ ≤ tf + th + O(tn (H) · (v + 1)2 logv+1 n), where tn (H) is the running time of any keyed hash function from Hλ on inputs of n blocks. Moreover, KHS has localization factor α = (v + 1)2 v, storage complexity τ = O(log n + σ(v + 1)2 λ logv+1 n + |desc(H)|), and runtime complexity ρ = O(tf + tn (H) · v 2 λ logv+1 n), where |desc(H)| is an upper bound on the description size of any hash function from Hλ and λ = O(log1+ϵ n) for any ϵ > 0. Remarks and proof idea. We note that if v = O(1), scheme KHS can localize v corruptions with a constant localization factor and polylogarithmic (in n) storage complexity. We also note that an active adversary could observe which blocks are being re-sent and then infer the coloring and build more efficient attacks. However, the honest parties share a key and can thus encrypt their communication and pad it to the upper bound on the localization factor so to not even release how many blocks are being resent. Now we outline the proof idea for Theorem 2. As ρ and τ can be checked by calculation, we only need to consider localization factor α. Obviously, Ti is related to the size of each corrupted segment Sa . Let i0 be such that each |Sa | ≤ nz −i0 −1 but some |Sa | ≥ n(v + 1)−i0 −2 . If we are able to show that |Ti | ≤ vn ∑ · (v + 1)−i for 0 ≤ i ≤ i0 , then |Tw−1 | ≤ |Ti0 | ≤ vn(v + 1)−i0 ≤ n(v + 1)−i0 −2 · v(v + 1)2 ≤ (v + 1)2 v a |Sa |, constant localization factor (v + 1)2 v. So we focus on proving |Ti | ≤ vn · (v + 1)−i for i ≤ i0 . Instead of a rigorous proof, we demonstrate the technical idea using the example in Figure 4, where the corrupted 14

segments are S1 = {7, 8} and S2 = {40}. Consider Row one and Row two in Figure 4. As in the proof idea for the HS scheme, one of L4i (x) for i = 0, 1, 2, 3 has at most ⌊v(y + 1)/y⌋ = ⌊2(v + 2)/(v + 1)⌋ = 2 corrupted segments. In our example, x has 2 corrupted segments SB1 , SB3 (see Row one). If there is coloring z such that SB1 , SB3 are assigned to the same color and SB2 , SB0 are assigned to other color(s), then SB2 and SB4 are uncorrupted and can be removed from T1 . This occurs with probability 1/4 · (3/4)2 . Since we have λ coloring experiments, this event won’t occur only with negligible probability. In our Row one, SB1 , SB3 are assigned to color ⋆; while SB2 is assigned to color ⋄ and SB4 is assigned to color ◦. Therefore, |T1 | ≤ 2(v + 1)w−1 ≤ vn · (v + 1)−1 . So it holds for i = 1. In iteration two, x is divided into (v + 1)2 = 16 segments. Again similar to the proof idea in HS scheme, there is i such that Li (x) has at most ⌊v(y+1)/y⌋ = ⌊2(v+2)/(v+1)⌋ = 2 corrupted segments. In our example, L1 (x) in row 5 has this property. T1 intersects with L1 (x) at most 2(v + 1) + 2 = 10 segments. In our example, it is 10 segments exactly. By our assumption, among these 10 segments, two are corrupted and the remaining are uncorrupted. In our example, SB2 and SB10 are corrupted. If in some experiment we can color these two with one color and the remaining 8 to other colors, then T2 ⊆ SB2 ∪ SB10 and thus |T2 | ≤ 2(v + 1)w−2 ≤ vn · (v + 1)−2 . The conclusion holds again. Such a coloring occurs with probability 1/4·(3/4)8 . Since there are λ colorings, this desired coloring does not occur with exponentially small probability only. The formal proof of the theorem carefully generalizes the idea in this description (see Appendix C for details). Acknowledgements. Jiang’s work was mainly done at U. of Calgary and is now supported by National 863 High Tech Plan (No. 2006AA01Z428), NSFC (No. 60673075) and UESTC Young Faculty Plans.

References 1. Bellare, M., Canetti R., and Krawczyk, H. Keying Hash Functions for Message Authentication. In Advances in Cryptology CRYPTO’ 96, pages 1-15, LNCS, Springer-Verlag. 2. Blaze, M. A Cryptographic File System for UNIX. In Proc. of 1993 ACM Conference on Computer and Communications and Security. 3. Blum, M. and Kannan S., Designing Programs That Check Their Work. In Proc. of the 1989 ACM Symposium on Theory on Computing. 4. Blum, M., Evans, W., Gemmell, P., Kannan S., and Naor, M. Checking the Correctness of Memories. In Proc. of the 1995 IEEE Symposium on Foundations on Computer Science. 5. Cattaneo, G., Catuogno, L., Del Sorbo, A., and Persiano, G. The Design and Implementation of a Cryptographic File System for UNIX. In Proc. of 2001 USENIX Annual Technical Conference. 6. Damgard, I. Collision Free Hash Functions and Public Key Signature Schemes, In Advances in Cryptology - EUROCRYPT’ 87, pages 203-216, LNCS, Springer-Verlag. 7. Di Crescenzo, G., Ghosh, A., and Talpade, R. Towards a Theory of Intrusion Detection. In Computer Security - ESORICS 2005, Proc. of 10th European Symposium on Research in Computer Security, LNCS 3679, Springer-Verlag. 8. Di Crescenzo G., and Vakil, F. Cryptographic hashing for Virus Localization. In Proc. of the 2006 ACM CCS Workshop on Rapid Malcode. 9. Du D. and Hwang, F. Combinatorial Group Testing and its Applications. World Scientific Publishing Company, May 2000. 10. Ghosh, A. and Swaminatha, T. Software security and privacy risks in mobile e-commerce. In Communications of the ACM, vol. 44 , n. 2, pp. 51–57, 2001 11. Goldreich, O., Goldwasser, S., and Micali, S. How to Construct Random Functions. In Journal of the ACM, Vol. 33, No. 4, 1986. 12. Kim, G., and Spafford, E. The design and implementation of tripwire: a file system integrity checker. In Proc. of 1994 ACM Conference on Computer and Communications Security. 13. Merkle, R. A Certified Digital Signature. In Advances in Cryptology - CRYPTO’ 89, LNCS, Springer-Verlag. 14. NIST. Secure hash standard. Federal Information Processing Standard, FIPS-180-1, April 1995. 15. NIST. Secure Hash Signature Standard (SHS) (FIPS PUB 180-2). United States of America, Federal Information Processing Standard (FIPS) 180-2, 2002 August 1. 16. NIST, Cryptographic Hash Algorithm Competition. http://csrc.nist.gov/groups/ST/hash/sha-3/index.html

15

17. Oprea, A., Reiter, M., and Yang, K. Space-Efficient Block Storage Integrity. In Proc. of 2005 Network and Distributed System Security Symposium. 18. Rivest, R. The MD5 Message-Digest Algorithm. Request for Comments (RFC 1320). Internet Activities Board, Internet Privacy Task Force, April 1992. 19. Russell, A. Necessary and Sufficient Conditions for Collision-Free Hashing. in Journal of Cryptology, vol. 8, n.2, 1995. 20. Skoudis, E. MALWARE: Fighting Malicious Code. Prentice Hall, 2004. 21. Szor, P. The Art of Computer Virus Research and Defense. Addison Wesley, 2005. 22. Stalling, W. and Brown, L. Computer Security: Theory and Practice. Prentice Hall, 2007. 23. Sivathanu, G., Wright, C., and Zadok, E. Ensuring Data Integrity in Storage: Techniques and Applications. In Proc. of the 2005 ACM International Workshop on Storage Security and Survivability. 24. 1st NIST Cryptographic Hash Functions Workshop, Program at http://www.csrc.nist.gov/ pki/HashWorkshop/2005/program.htm

A Definitions and preliminaries Basic definitions. For any (possibly probabilistic) algorithm A, an oracle algorithm is denoted as AO , where O is an (oracle) function, and the notation a ← A(x, y, z, . . .) denotes the random process that runs algorithm A on input x, y, z, . . ., and denotes the resulting output as a. Let H = {Hλ }λ∈N , where Hλ is a set of functions hλ : {0, 1}p1 (λ) → {0, 1}p2 (λ) . A family of hash functions, denoted as H, is a family of polynomial-time (in λ) samplable and computable functions hλ that take a p1 (λ)-bit input, also called message, and return a p2 (λ)-bit output, also called tag, where p1 , p2 are polynomials and λ is a security parameter. (In practical applications, p2 is set to be much smaller than p1 .) Let H = {Hλ }λ∈N , where Hλ is a set of functions hλ : {0, 1}λ × {0, 1}p1 (λ) → {0, 1}p2 (λ) . A family of keyed hash functions, denoted as H, is a family of polynomial-time (in λ) samplable and computable functions hλ that take as input a λ-bit key and a p1 (λ)-bit message, and return a p2 (λ)-bit output, also called tag, where p1 , p2 are polynomials. As the security parameter λ and the key k are often clear from the context, we use the term h to denote a hash function with security parameter λ and the notation hk to denote a keyed hash function with security parameter λ taking key k as input. Collision-Resistant Hash Functions. We consider families of hash functions that compress an arbitrarily large input to a fixed-size output. In general each output has a very large set of inputs that are mapped to it, yet the collision-resistance property states that any efficient algorithm will find two preimages for the same output only with some small probability. We now recall their formal definition. Definition 2. Let H = {Hλ }λ∈N be a family of hash functions. For any t, ϵ > 0, we say that H is (t, ϵ)collision-resistant if for any algorithm A running in time t, Prob[ hλ ← Hλ ; (x1 , x2 ) ← A(1λ , hλ ) : x1 ̸= x2 ∧ hλ (x1 ) = hλ (x2 ) ] ≤ ϵ. We also say that H is collision-resistant if it is (t, ϵ)-collision-resistant for t polynomial in λ and ϵ negligible in λ. Several constructions of collision-intractable families of hash functions have been given in the literature, both assuming the computational intractability of number-theoretic problems (see, e.g., [6]), and assuming complexity-theoretic assumptions (see, e.g., [19]). Moreover, researchers also studied heuristic constructions, such as MD5, SHA1, SHA2 [18, 14, 15], which, although unproved to be collision-resistant, are much more efficient and currently used by real-life cryptographic systems and products. In the light of recent cryptanalysis results on previous heuristic proposals, researchers have proposed a number of new functions (see, e.g., [24]) and a request for standard submission was recently issued [16]. Although the security properties 16

of these functions cannot be formally proved, many of them follow one paradigm [13, 6] which guarantees their collision resistance over inputs of arbitrary length starting from the collision resistance over inputs of fixed length of an atomic compression function. Keyed Hash Functions. Starting with [1], researchers have considered an important variant of hash functions: keyed hash functions. In particular, the collision-resistance property of keyed hash functions (including those built on top of the above mentioned heuristic constructions) can be proved under more rigorously defined assumptions on their atomic compression functions. For such functions, the collision-resistance property states that all efficient algorithms that can make queries to the keyed hashing function can find two preimages for the same output only with small probability. We now recall their formal definition. Definition 3. Let H = {Hλ }λ∈N be a family of keyed hash functions. For any t, q, ϵ > 0, we say that H is (t, q, ϵ)-collision-resistant if for any oracle algorithm A running in time t and making at most q oracle queries, Pr[hλ ← Hλ ; k ← {0, 1}λ ; (x1 , x2 ) ← Ahλ (k;·) (1λ ) : x1 ̸= x2 ∧ hλ (k; x1 ) = hλ (k; x2 )]

is at most ϵ. We also say that H is collision-resistant if it is (t, q, ϵ)-collision-resistant for t, q polynomial in λ and ϵ negligible in λ. Pseudo-Random Functions. First studied in [11], pseudo-random functions play a fundamental role in security and cryptography and have found numerous applications throughout. Formally, consider a family of functions F = {fk }λ∈N , where the key k is of size |k| = λ and is randomly chosen, and function fk : {0, 1}p1 (λ) → {0, 1}p2 (λ) is computable in polynomial-time (in λ). Let r : {0, 1}p1 (λ) → {0, 1}p2 (λ) be a random function. We say that the family of functions F is (t, q, ϵ)-pseudorandom if for any polynomialtime oracle algorithm A running in time t and making at most q oracle queries, the probability that A returns 1 when its oracle is fk minus the probability that A returns 1 when its oracle is r, is at most ϵ.

B Proof of Theorem 1 The proof of Theorem 1 is divided into two parts, as the adversary can be successful in two ways: either by preventing effective localization (i.e., one of the modified blocks is not included in T ), or by forcing the scheme to exceed the expected localization factor (i.e., the size of T is larger than α times Diffv [x, x′ ]). We show that in the first case, this violates the collision intractability of H, and that the second case happens with probability 0, for the value of α provided in the theorem. P ROVING EFFECTIVE LOCALIZATION . Let Coll(A) be the event defined as follows: “For some ℓ ∈ {0, . . . , y− ′ ′ 1}, i ∈ {1, . . . , w−1}, and j ∈ {0, . . . , z −1}, it holds that Sℓ,i,j ̸= Sℓ,i,j and hλ (Sℓ,i,j ) = hλ (Sℓ,i,j ), where ′ ′ Sℓ,i,j , Sℓ,i,j are two segment lists obtained when running CLH on input x, x returned by A, respectively.” We can then write ϵ′ ≤ Prob[ Coll(A) ] + Prob[ HExpHS,A,hash (α, v) = 1|Coll(A) ]. In this first part of the proof we show that Prob[ Coll(A) ] ≤ ϵ under the assumption that H is (t, ϵ)-collisionresistant. (Later, we show that Prob[ HExpHS,A,hash (α, v) = 1|Coll(A) ] = 0 for the value α claimed in the theorem, from which the theorem follows.) Bounding the collision probability: We use A to compute an algorithm A′ such that, if event Coll(A) happens then A′ violates the collision-intractability of H. On input a hash function hλ randomly chosen from Hλ , algorithm A′ , runs the following steps: 1. Let (x, x′ ) = A(α, v) 17

2. For i = 1, . . . , w − 1, and ℓ = 0, . . . , y − 1, set s = ℓ · z w−i /y, and compute xℓ,i = Ls (x) and x′ℓ,i = Ls (x′ ) split xℓ,i into segments Bℓ,i,0 ∥ · · · ∥Bℓ,i,z i −1 of equal length ′ ′ split x′ℓ,i into segments Bℓ,i,0 ∥ · · · ∥Bℓ,i,z i −1 of equal length for j = 0, . . . , z − 1, compute segment list Sℓ,i,j = (Bℓ,i,j ∥Bℓ,i,j+z ∥ · · · ∥Bℓ,i,j+z i −z ) ′ ′ ′ ′ compute segment list Sℓ,i,j = (Bℓ,i,j ∥Bℓ,i,j+z ∥ · · · ∥Bℓ,i,j+z i −z ) ′ ′ if hλ (Sℓ,i,j ) = hλ (Sℓ,i,j ) then return: (Sℓ,i,j , Sℓ,i,j ). 3. return: ⊥. We see that A′ obtains A’s output and then runs the same steps as A on input x (resp., x′ ) to obtain segment ′ lists Sℓ,i,j (resp., Sℓ,i,j ). Thus, we obtain that ϵ′ = ϵ and t′ = t + O(tn (H) · yz logz n), where tn (H) is the max running time of any function from Hλ on inputs of length n. Moreover, by observing that algorithm LOC in HS only removes blocks from the current localizing ′ subset Ti , for i = 1, . . . , w − 1, in corresponding of an equality hλ (Sℓ,i,j ) = hλ (Sℓ,i,j ), we obtain the following lemma, which implies effective localization of HS. Lemma 1. If event Coll(A) does not happen then, for i = 1, . . . , w − 1, it holds that x[Ti ] = x′ [Ti ]. C ORRECTNESS OF LOCALIZATION FACTOR . We now assume that event Coll(A) does not happen; that is: ′ for all ℓ, i, j, and any two segment lists Sℓ,i,j , Sℓ,i,j obtained when running CLH on input x, x′ returned by A, respectively, it holds that S = S ′ or h (S ) ̸= h (S ′ ). We show Prob[ HExpHS,A,hash (α, v) = ℓ,i,j

ℓ,i,j

λ

ℓ,i,j

λ

ℓ,i,j

1|Coll(A) ] = 0 for the value α claimed in the theorem, from which the theorem follows. Towards this goal, we prove the following lemma. Lemma 2. Let z, v, y be positive integers such that v < zy(y + 1)−1 and y | z, assume y > 1 and let n = z w . Also, let x, x′ be two n-block inputs and tag = CLH1 (x). Let T0 , . . . , Tw−1 be the sets computed by algorithm LOC1 when run on input (v, x′ , tag). We have that if Diffv [x, x′ ] ≤ nz −i y −1 for some i, then {j | x[j] ̸= x′ [j], j = 0, · · · , n − 1} ⊆ Ti and |Ti | ≤ ⌊v(y + 1)/y⌋i z w−i . Proof. We prove the lemma by induction over i = 0, . . . , w − 1. When i = 0, T0 = {0, 1, · · · , n − 1}. The conclusion follows trivially. We now consider i > 0 such that Diffv [x, x′ ] ≤ nz −i y −1 . In this case, assume S1 , . . . , Sv achieves Diffv [x, x′ ]. Then |St | ≤ nz −i y −1 for t = 1, . . . , v. We observe that at most one ℓ satisfies the following property (♠): ∃j1 < j2 , Sa ∩ Bℓ,i,j1 ̸= ∅ and Sa ∩ Bℓ,i,j2 ̸= ∅. (remark: since Sa is a segment, we can assume j2 = j1 + 1). Otherwise, we can assume ∃ (ℓ, j) and (ℓ′ , j ′ ) (j < j ′ ) such that Sa ∩ Bℓ,i,j ̸= ∅, Sa ∩ Bℓ,i,j+1 ̸= ∅, Sa ∩ Bℓ′ ,i,j ′ ̸= ∅, Sa ∩ Bℓ′ ,i,j ′ +1 ̸= ∅

(4)

The first two inequalities imply that index t1 := ℓ · z w−i /y + (j + 1)z w−i − 1 ∈ Sa and the last two inequalities imply t2 := ℓ′ · z w−i y −1 + (j ′ + 1)z w−i ∈ Sa . Note t2 − t1 ≥ z w−i + (ℓ′ − ℓ)z w−i y −1 + 1 ≥ z w−i y −1 + 1 = nz −i y −1 + 1. Thus, t1 < t2 and |{t1 , t1 + 1, . . . , t2 }| > |Sa |. On the other hand, since Sa is a single segment, it follows that {t1 , t1 + 1, . . . , t2 } ⊆ Sa , which is a contradiction as |Sa | ≤ nz −i y −1 . Thus, for each a (given i), the value ℓ that satisfies property (♠) is unique. Denote this special ℓ (if it exists) as ℓa . Then, for any ℓ ̸= ℓa , there is only one block Bℓ,i,j (over j) s.t. Sa ∩ Bℓ,i,j ̸= ∅ (the uniqueness implies 18

Sa ⊆ Bℓ,i,j for this special j). These two facts imply that each Sa intersects with at most y + 1 blocks Bℓ,i,j (over ℓ, j) and thus intersects with at most y + 1 segment lists Sℓ,i,u (over ℓ, u) since Bℓ,i,j is assigned to a unique Sℓ,i,u . Thus, for given i, there are at most v(y + 1) segment lists Sℓ,i,u (over ℓ, u), each of which is intersected by some St . ∑ Now define Nℓ = {j | Sℓ,i,j ∩ (∪va=1 Sa ) ̸= ∅, j = 0, · · · , z − 1}. Then we have y−1 ℓ=0 |Nℓ | ≤ v(y + 1). Thus, there exists ℓi ∈ {0, · · · , y − 1} such that |Nℓi | ( ≤ ⌊v(y + 1)/y⌋. Note since ) for any

z −1 j ̸∈ Nℓi we have Sℓi ,i,j ∩ (∪va=1 Sa ) = ∅, it follows that Ti ⊆ Ti−1 \ ∪j̸∈Nℓi ∪t=0 Iℓi ,i,j+zt = Ti−1 ∩ ( ) i−1 ∪j∈Nℓi ∪zt=0 −1 Iℓi ,i,j+zt . Iteratively applying this relation, we have i−1

( ) ( ) 1−1 z i−1 −1 Ti ⊆ T0 ∩ ∪j∈Nℓ1 ∪zt=0 −1 Iℓ1 ,1,j+zt ∩ · · · ∩ ∪j∈Nℓi ∪t=0 Iℓi ,i,j+zt ( ) ( ) 1−1 i−1 = ∪j∈Nℓ1 ∪zt=0 −1 Iℓ1 ,1,j+zt ∩ · · · ∩ ∪j∈Nℓi ∪zt=0 −1 Iℓi ,i,j+zt = ∪j1 ∈Nℓ1 · · · ∪ji ∈Nℓi {(∪t1 Iℓ1 ,1,j1 +zt1 ) ∩ · · · ∩ (∪ti Iℓi ,i,ji +zti )} Note given j ∈ {0, · · · , z − 1}, an index g ∈ ∪zt=0 −1 ∈ Iℓ,i,j+zt if and only if i−1

g ≡ ℓ · z w−i y −1 + j · z w−i + γ

(mod z w−i+1 ) for some γ ∈ {0, · · · , z w−i − 1}.

Therefore, g ∈ (∪t1 Iℓ1 ,1,j1 +zt1 ) ∩ · · · ∩ (∪ti Iℓi ,i,ji +zti ) is equivalent to w−1 −1 y + j1 z w−1 + γ1 (mod z w−1+1 ) for some γ1 ∈ {0, · · · , z w−1 − 1} g ≡ ℓ1 · z .. (∗) . w−i −1 w−i w−i+1 g ≡ ℓi · z y + ji z + γi (mod z ) for some γi ∈ {0, · · · , z w−i − 1}.

Now we show that (∗) only has z w−i solutions for g. To do this, we show that for a fixed γi ∈ ′ {0, · · · , z w−i − 1}, the solution for g is unique. Indeed, when fixing γi , there exists a unique γi−1 ∈ w−(i−1) ′ w−i −1 w−i w−i+1 {0, · · · , z − 1} such that γi−1 ≡ ℓi · z y + ji · z + γi (mod z ). Since g ≡ ℓi−1 · z w−i+1 y −1 + ji−1 · z w−i+1 + γi−1 (mod z w−i+2 ), it follows that g ≡ ℓi−1 · z w−i+1 y −1 + ji−1 · z w−i+1 + ′ γi−1 (mod z w−i+1 ). That is, g ≡ ℓi−1 · z w−i+1 y −1 + γi−1 (mod z w−i+1 ). Thus, γi−1 = γi−1 − ℓi−1 · w−i+1 −1 w−i+1 w−i+1 z y (mod z ), determined, because γi−1 ∈ {0, · · · , z − 1}. Iteratively applying this reasoning, we have that γi−2 , · · · , γ1 are all determined. Thus, g is determined. Thus, (∗) has exactly z w−i solutions. Now we come back to bound |Ti |. We have |Ti | ≤ |Nℓ1 | · · · |Nℓi |z w−i ≤ ⌊v(y + 1)/y⌋i z w−i . This completes our proof. ⋄ Concluding the proof of Theorem 1. Let i be the maximum number such that |Sa′ | ≤ nz −i y −1 for some a′ ∈ {1, · · · , v}, where Sj , j = 1, · · · , v is as in Lemma 2. By Lemma 2, |Tw−1 | ≤ |Ti | ≤ ⌊v(y + 1)/y⌋i z w−i ≤ ⌊v(y + 1)/y⌋i zy|Sa′ |.

(5)

∑ ∑ Hence, |Tw−1 | ≤ ⌊v(y + 1)/y⌋i zy va=1 |Sa | ≤ ⌊v(y + 1)/y⌋w−1 zy va=1 |Sa |. Since ⌊v(y + 1)/y⌋−1 zy · nlogz ⌊v(y+1)/y⌋ = α, the probability that KExpKHS,A,hash (α, v) returns 1 in step 5 (i.e., because |T | > α · Diffv [x, x′ ]) is 0. Since logz ⌊v(y + 1)/y⌋ < 1, HS has a sub-linear localization factor. This concludes the proof of the localization property of HS and thus the proof of Theorem 1. 19

C Proof of Theorem 2 We start the proof by defining a variation of the experiment KExpKHS,A,keyh , denoted as KExpKHS,A,keyh,∼ , and formally defined as equal to KExpKHS,A,keyh , except that the pseudo-random strings returned by the pseudo-random function fk1 , fk2 are replaced by random and independent strings of the same length. As for scheme HS, we formally defining a collision event, denoted as Coll∼ (A), as follows: “For ′ some ℓ ∈ {0, . . . , y − 1}, i ∈ {1, . . . , w − 1}, and j ∈ {0, . . . , z − 1}, it holds that Sℓ,i,j ̸= Sℓ,i,j and ′ ′ hλ (k3 ; Sℓ,i,j ) = hλ (k3 ; Sℓ,i,j ), where Sℓ,i,j , Sℓ,i,j are two segment lists obtained when running CLKH (using random instead of pseudo-random strings) on input x, x′ returned by A, respectively.” We can then write ϵ′ ≤ Prob[ KExpKHS,A,keyh (α, v) = 1 ] − Prob[ KExpKHS,A,keyh,∼ (α, v) = 1 ] +Prob[ Coll∼ (A) ] + Prob[ KExpKHS,A,keyh,∼ (α, v) = 1|Coll∼ (A) ] The rest of the proof is divided into three parts: First, we observe that the difference Prob[ KExpKHS,A,keyh (α, v) = 1 ] − Prob[ KExpKHS,A,keyh,∼ (α, v) = 1 ] is upper bounded by 2ϵf . Second, similar to the proof for HS, Prob[ Coll∼ (A) ] ≤ ϵh . Finally, we show that Prob[ HExpHS,A,hash (α, v) = 1|Coll∼ (A) ] = 0 for the value α claimed in the theorem. Similarly as in the proof of Theorem 1, these facts imply that scheme KHS has effective localization and has localization factor α, from which the theorem follows. The first upper bound. We would like to compute an upper bound on the difference Prob[ KExpKHS,A,keyh (α, v) = 1 ] − Prob[ KExpKHS,A,keyh,∼ (α, v) = 1 ]. Since the two experiments only differ in the computation of the string psr1 and psr2 , which are pseudorandom in KExpKHS,A,keyh and random in KExpKHS,A,keyh,∼ , we can use a rather standard simulation argument to show that this difference can be upper bounded by 2ϵf . Furthermore, we obtain that t′ ≤ tf + O(tn (H) · yz logz n), where tn (H) is the max running time of any function from Hλ on inputs of length n. The second upper bound. We now assume that in experiment KExpKHS,A,keyh,∼ event Coll∼ (A) happens.

In this case we can use A to compute an algorithm A′ that violates the collision-intractability of the family H of keyed hash functions. Specifically, algorithm A′ , on input a keyed hash function hλ randomly chosen from Hλ , runs the following steps, where the computation of hλ (k3 ; M ) is done by querying M to his hλ oracle: 1. Let (x, x′ ) = A(α, v) 2. Randomly choose a key k1 , k2 for pseudo-random function f 3. Run algorithm CLKH on input x, using hλ as a keyed hash function, and let Sℓ,i,j be the segment lists thus obtained, for i = 1, . . . , w − 1, ℓ = 0, . . . , y − 1 and j = 0, . . . , z − 1 ′ be the segment lists 4. Run algorithm CLKH on input x′ , using hλ as a keyed hash function, and let Sℓ,i,j thus obtained, for i = 1, . . . , w − 1, ℓ = 0, . . . , y − 1 and j = 0, . . . , z − 1 ′ ′ ), else return: ⊥. ) then return: (Sℓ,i,j , Sℓ,i,j 5. if for some i, ℓ, j, hλ (k3 ; Sℓ,i,j ) = hλ (k3 ; Sℓ,i,j ′ We see that A obtains A’s output and then runs the same steps as A on input x (resp., x′ ) to obtain segment ′ ). Thus, we obtain that Prob[ coll(A) ] ≤ ϵh and t′ ≤ th + O(·tn (H) · yz logz n), lists Sℓ,i,j (resp., Sℓ,i,j where tn (H) is the max running time of any function from Hλ on inputs of length n. 20

Showing the adversary’s failure. We now assume that in KExpKHS,A,keyh,∼ event Coll∼ (A) does not hap′ pen; that is: for all ℓ, i, j, and any two segment lists Sℓ,i,j , Sℓ,i,j obtained when running CLKH (using random ′ ′ instead of pseudo-random strings) on input x, x returned by A, respectively, it holds that either Sℓ,i,j = Sℓ,i,j or h (k ; S ) ̸= h (k ; S ′ ). We now show that Prob[ HExpKHS,A,hash (α, v) = 1|Coll∼ (A) ] = 0 for λ

3

ℓ,i,j

λ

3

ℓ,i,j

the value α claimed in the theorem, from which the theorem follows. As tagℓ,i,c,z is masked by a one-time pad (each pad is used only once), we can assume that the adversary never query the tag oracle. Towards our goal, we prove the following lemmas. Lemma 3. Let v ≥ 2 and ρ = n(v + 1)−1 . Let x and x′ be two n-block inputs and Diffv [x, x′ ] is achieved by S1 , · · · , Sv . If |Sa | ≤ n(v + 1)−i−1 for all a = 1, . . . , v and for some i, then there exist ℓi ∈ {0, . . . , v} and u1i , . . . , uvi ∈ {0, . . . , (v + 1)i − 1} s.t. Sa ⊆ Bℓi ,i,uai for all a = 1, . . . , v. Proof. In Lemma 2, we showed that given i and a, if |Sa | ≤ (v + 1)−i−1 , there exists at most one ℓ ∈ {0, . . . , v} satisfying property (♠): ∃j1 , j2 s.t. Sa ∩ Bℓ,i,j1 ̸= ∅ and Sa ∩ Bℓ,i,j2 ̸= ∅. Since there are at most v segments Sa , there is at least one li ∈ {0, · · · , v} ∃uai for each a such that Sa ⊆ Bℓi ,i,uai . ⋄ Based on Lemma 3, we can prove the following. Lemma 4. Let v ≥ 2. For i ∈ {0, 1, . . . , w − 1}, if |Sa | ≤ n(v + 1)−i−1 for all a, then |Ti | ≤ vn(v + 1)−i , except for a negligible probability. Proof. Note that since one-time pad psr2 perfectly masks hλ (k3 ; ·), before adversary outputs x′ , the randomness in coloring as well as all hλ (k3 ; Sℓ,j,c,z ) is independent of his view, where the independence is over the randomness of the coloring only. Our proof uses induction on i. We show that if |Sa | ≤ n(v + 1)−i−1 for all a = 1, · · · , v, then with high probability (to be determined soon), Ti ⊆ Bℓi ,i,u1 ∪ · · · ∪ Bℓi ,i,uvi , where i u1i , · · · , uvi , ℓi are defined as in Lemma 3. This immediately implies that |Ti | ≤ vn(v + 1)−i . For i = 0, w−1 T0 = {0, · · · , n − 1} and by default define Bℓ,0,0 = Lℓ·(v+1) (x), ℓ = 0, · · · , v. The conclusion holds trivially for case i = 0. Assume the conclusion holds for case i − 1. We consider case i. In this case, since we assume |Sa | ≤ n(v + 1)−i−1 < n(v + 1)−(i−1)−1 , by induction for case i − 1, we have Ti ⊆ Ti−1 ⊆ Bℓi−1 ,i−1,u1

i−1

∪ · · · ∪ Bℓi−1 ,i−1,uvi−1 .

(6)

By definition of ℓk , uak for k = i − 1, i and a = 1, · · · , v, we have that Sa ⊆ Bℓi−1 ,i−1,uai−1 ∩ Bℓi ,i,uai . Among all Bℓi ,i,j for j = 0, · · · , (v + 1)i − 1, Bℓi−1 ,i−1,uai−1 intersects with ≤ v + 2 of them. Let D = {Bℓi ,i,j | Bℓi ,i,j ∩ (∪va=1 Bℓi−1 ,i−1,uai−1 ) ̸= ∅, j = 0, · · · , (v + 1)i − 1}. We have |D| ≤ v(v + 2) and especially Bℓi ,i,uai ∈ D for a = 1, · · · , v. Consider the event Goodi : Among all λ colorings, there exists a coloring zi such that Bℓi ,i,uai for a = 1, · · · , v are assigned to an identical color ci while the remaining Bℓi ,i,j ’s in D are assigned to colors other than ci . It is immediate that ( )λ 2 Pr[Goodi ] ≥ 1 − 1 − (v + 1)−v+1 (v/(v + 1))v +v (7) which is at least 1 − exp(−λ(v + 1)−(v+2)v+1 v v

2 +v

). When Goodi occurs,

Ti ⊆ Ti−1 \ ∪c̸=ci Iℓi ,i,c,zi ⊆ Iℓi ,i,ci ,zi ∩ Ti−1 ⊆ ∪va=1 Bℓi ,i,uai .

(8)

Note that our induction assumption for case i − 1 assumes Goodt , t = 1, · · · , i − 1. It follows the conclusion holds with probability ∑ Pr[∩it=1 Goodt ] ≥ 1 − t Pr[Goodt ] 2 ≥ 1 − i exp(−λ(v + 1)−(v+2)v+1 v v +v ) 2 ≥ 1 − exp(−λ(v + 1)−(v+2)v+1 v v +v ) · logv+1 n. 21

⋄

This completes the proof.

Let i be the maximum such that Sa′ ≤ n(v + 1)−i−1 for some a′ . From Lemma 4, except ∑ for a negligible probability, we have |Tw−1 | ≤ |Ti | ≤ v(v + 1)w−i ≤ (v + 1)2 v|Sa′ | ≤ (v + 1)2 v va=1 |Sa | Thus, the localization factor α is equal to (v + 1)2 v, and the probability that experiment KExpKHS,A,keyh,∼ (α, v) returns 1 in this case is negligible. This concludes the proof of the localization property of KHS and thus the proof of Theorem 2.

22