Abstract
Large amount of data are being produce by Internet-of-things sensor networks and applications. Secure and efficient deduplication of Internet-of-things data in the cloud is vital to the prevalence of Internet-of-things applications. In order to ensure data security for deduplication, different data should be assigned with different privacy levels. We propose a deduplication scheme based on threshold dynamic adjustment to ensure the security of data uploading and related operations. The concept of the ideal threshold is introduced for the first time, which can be used to eliminate the drawbacks of the fixed threshold in traditional schemes. The item response theory is adopted to determine the sensitivity of different data and their privacy score, which ensures the applicability of data privacy score. It can solve the problem that some users care little about the privacy issue. We propose a privacy score query and response mechanism based on data encryption. On this basis, the dynamic adjustment method of the popularity threshold is designed for data uploading. Experiment results and analysis show that the proposed scheme based on threshold dynamic adjustment has decent scalability and practicability.
Keywords
Introduction
With the rapid development of Internet-of-things (IoT) sensor networks and applications, an increasing amount of data are generated and stored in cloud services. Cloud storage not only evolves into a major storage scheme but also provides IoT applications with abundant if not limitless storage capability. Duplicate data are almost unavoidable with thousands of IoT sensors working all day long. 1 Data sharing among users or devices has become a common requirement. These challenges are new to cloud storage providers (CSPs). As the amount of uploaded data increases, so does the extent of data redundancy. Statistics show that up to 60% of the data stored in cloud storage are redundant data, 2 and a large amount of cloud storage resources are consumed, which greatly increases the cost of storage and maintenance of the CSP, especially for IoT-sensor network–based applications. 3
In order to solve the above problems, the CSP generally adopts deduplication technology, 4 which detects identical data objects in the upload stream based on data redundancy. Deduplication enabled systems store only a single copy of the data and create links for other users (or IoT devices) who upload the same data. Deduplication schemes can be classified into block-level 5 and file-level deduplication depending on the size of the objects. Compared with traditional data compression technology, deduplication eliminates not only data redundancy in files but also redundancy between files in shared data sets.6,7 However, some users are unaware of data security, resulting in a large amount of private data information being shared without the user’s consent. In recent years, large-scale data leakage events triggered great concern about privacy issues. 8 Therefore, how to protect user privacy while improving the efficiency of cloud deduplication for IoT applications has become a key issue. 9 Harnik et al. 10 first discussed the security problem of client-side deduplication. Since then, the subject has been extensively explored and it is still under investigation in methodological aspects and concrete applications as well. A deduplication scheme for encrypting upload data is proposed for the first time in Bolosky et al., 11 known as convergent encryption. In this scheme, the hash value of the data is used as the encryption key. However, the direct relation between the key and the data reduces the security of the scheme. In Xu et al., 12 the multi-client cross-deduplication scheme Xu-CDE was first applied for the encrypted ciphertext deduplication problem. 13 The scheme protects the security of private data in the scenario, where external attackers coexist with honest but curious servers. However, in terms of applicability, this scheme has the disadvantages of low encryption efficiency, and it lacks real-time authentication mechanism. In view of the above shortcomings, MRN-CDE (MLE based and random number modified client-side deduplication of encrypted data in cloud storage) was proposed, 14 which applies random number to ensure the instantaneity of the authentication credentials. In order to reduce the amount of computation in the encryption and decryption processes and to ensure data security, the scheme extracts the key from the original data using the KP algorithm 15 in the message locked encryption (MLE) scheme. In addition, some CSP provides users with client-side encryption options, allowing users or IoT sensor networks to encrypt the data before uploading them. This method can effectively protect data privacy. However, even identical plaintext can be encrypted into different ciphertexts by different users, which makes it difficult for the CSP to perform deduplication. Therefore, although the above scheme improves the security of cloud storage, the storage efficiency is still unsatisfying. 16
For the efficiency of deduplication, Stanek et al. proposed a scheme based on popularity partition. Data of different popularity are encrypted with different encryption methods, which can effectively improve the efficiency of deduplication.
17
The scheme assigns a fixed popularity threshold (
In current practical cloud storage applications, the CSP simply sets a fixed threshold for all data uploaded, which leads to many problems. If the threshold is set too high, for data with low privacy, all copies need to be stored before the threshold is reached. If the threshold is set too low, data with higher privacy will be prematurely deduplicated, which may increase the security risk. Therefore, different thresholds should be set according to the privacy level of the data, and the user’s understanding of the privacy level should be considered at the same time. For example, a frequently used software installation package should have a low threshold, so that it can be quickly processed for deduplication, which minimizes the storage overhead without compromising user’s privacy. And when internal confidential files of a company are uploaded, according to the user’s understanding of the privacy level, the CSP could set a relatively large threshold for it, thereby effectively avoiding premature execution of deduplication and better protecting the user data. However, how to recognize the privacy level of each upload data and assign a reasonable threshold for the data according to its privacy level is still a difficult and open problem.
In summary, our work makes the following contribution:
We propose a deduplication scheme suitable for IoT sensor networks based on threshold dynamic adjustment to ensure the security of upload data and related operations.
We design the threshold dynamic adjustment mechanism, using the item response theory (IRT) to dynamically adjust the threshold. We use the query feedback mechanism to collect the privacy attitude of most users to determine a reasonable threshold for each data uploaded.
The concept of ideal threshold is proposed for the first time, which eliminates the disadvantages of unified threshold in traditional schemes.
The remainder of this article is organizes as follows. Section “System model and design goals” introduces the system model and design goals. Section “Preliminary knowledge” gives the preliminary knowledge of the scheme and its related formulas. Section “Deduplication scheme” elaborates on the design of our scheme from three parts: privacy score query, data uploading, and privacy score calculation. Section “Security analysis” details security analysis. Section “Performance evaluation and experiments” gives the experimental comparison and analysis. Section “Conclusion” summarizes the work and draws conclusions.
System model and design goals
System model
The system design is based on the IoT sensor network. The system model of this scheme only involves two types of entities: upload user and CSP. Upload user is an abstraction for IoT device, but sometimes it is an actual user in the IoT sensor network. When the system is established, the upload user can interact with the CSP. During the interaction, the upload user can play two roles: data uploader or data observer, and the CSP can only provide data storage and data sharing services for the upload user, without knowing the exact content of the data. The system model is shown in Figure 1.

System model.
This model introduces the concept of privacy scores (
Design goals
In order to better protect data privacy, the scheme should have the following characteristics:
Confidentiality of uploaded data: uploaded data require a certain encryption operation.
Queryability of privacy scores: when users upload data, they can query and get a reasonable privacy score from the CSP as a reference.
Updateability of privacy scores and thresholds: the privacy scores and thresholds of data can be updated in real time according to the specific upload situation.
Preliminary knowledge
IRT and its characteristics
IRT 23 is a famous psychological theory, which is often used to analyze the questionnaire results and test data. This theory can infer the probability that the tested user will correctly answer a given question by measuring the ability of the tested user and the difficulty of the specific test item. Moreover, it has been proved that IRT can be applied in cloud computing scenarios. 24
The Rasch model
25
is one of the most common IRT models. It assumes that the probability function of correct response is only related to
Therefore, IRT has two notable features:
Group stableness: the difficulty of the item is a natural attribute of the item, which is independent of the tested user’s response. In other words, the parameters of an individual project are not only applicable to users current being tested but also to other types of users. 26
User independence: one tested user will not affect the answer of another tested user to a question, and the answer only depends on the question itself.
General sensitivity calculation method
Generally, the more sensitive the data are, the less likely for the user to disclose it. As shown in equation (2),
where
General visibility calculation method
In the case where the answer to the question is a binary value, we usually estimate the probability to calculate the visibility of the data.
27
Assuming that the test project and the tested users are independent of each other, that is, in a test survey, the chance of the tested user answering each question is the same. We can calculate the value of
where
Data privacy score
The privacy score (
where operator ⊗ represents any monotone incremental combination of functions about sensitivity and visibility. The details of the calculation process are described in equations (2) and (3).
Bilinear mapping
Let
Bilinear:
Computability:
Non-degenerate:
Deduplication scheme
Overview
When a user uploads encrypted data to the CSP, the CSP can check whether the data have already been stored using the query label generated by elliptic curve technique and return a suggested

The overall design of the scheme.
Privacy score query
When a user uploads data to the CSP, the user can query the current
In Harkous et al.,
24
a scheme of data privacy query using context is introduced in detail. Before querying the degree of privacy, the user forms a query set of multiple virtual contexts and sends them to the CSP, to hide the actual request context. The context-based privacy score query only needs to replace the data privacy with the corresponding privacy score based on the above scheme. When the user queries the CSP for the
Context-based privacy score query is easy to implement, but we usually assume that the CSP is honest but curious, and it can obtain specific data information uploaded by users through offline analysis and other operations. Therefore, we adopt the privacy score query mechanism of encrypted data. Based on the existing research results of our team, this mechanism uses the elliptic curve-based file label query scheme
20
to facilitate the privacy score query. In Zhang et al.,
20
a popularity query protocol without online trusted third parties is proposed. By constructing a bilinear map query label
Popularity threshold
In order to improve the efficiency of deduplication, the CSP assigns a popularity threshold
Data uploading procedure
When the

The data uploading procedure in our scheme.
When
Privacy score calculation
In this section, we design the privacy score calculation method based on IRT, in order to ensure that the data’s
Because different upload data
For each parameter
In other words, we need to search for the data parameters
Similarly, on the basis that the sensitivity
Finally, we integrate it through formula (6) and calculate the
At the same time, current
As the number of upload users increases, the threshold changes and gradually approaches
Because different upload data are independent of each other, the CSP can compute the
The parameters used in the privacy score calculation based on IRT are estimated by likelihood function which satisfies the property requirement of group invariance. This makes the
Security analysis
The scheme is designed to better protect the security of private data through threshold dynamic adjustability. Our proposed deduplication scheme makes it impossible for the CSP to be spoofed. The data could be obtained only by obtaining the query label of the data. Here, we mainly discuss the authenticity and differentiability of the query labels. The security theorem is as follows.
Lemma 1
For a safe hash function
Theorem 1
Proof
If
Assuming that adversary
Assuming that adversary
In both cases, adversary
That is,
Theorem 2
Proof
Suppose there is
If the above formula is valid,
Lemma 2
Compute Diffie–Hellman (CDH) problem. Suppose
Theorem 3
Proof
Let CSP execute an offline brute force attack on the query label
In addition, we also considered the situation of malicious scoring. We made simulations, and detailed results can be found in the next section.
Performance evaluation and experiments
The experiment uses PBC,
30
GMP,
31
PBC_bce,
32
and OPENSSL
33
function libraries, which are implemented by C++ language. It is deployed to a Tencent’s cloud storage server, which is equipped with 4 GB memory, 4-core CPU, 1 Mbps bandwidth, and 1 TB storage. In order to make it easier for users to understand and to operate, our scheme adopts a more user friendly design in the implementation of
Data set
In view of the problem that different data have different ideal thresholds, we carry out a comparative experiment on the overall
In the performance comparison experiment, we choose 1000 files of 10 MB as the upload data, in which the ratio of data with lower privacy to data with higher privacy is about 3/2. Other schemes adopt the unified popularity threshold and set it to
Experimental analysis
The data of the above three groups of experimental data sets are simulated by uploading and dynamic adjusting threshold, respectively, and the changes of the whole privacy score and
Figures 4–6 are derived from the data set with the interval of (80–100), where Figure 4 shows the change of the

Privacy score with the number of upload users (80–100).

Relationship between threshold and privacy score (80–100).

Actual deduplication threshold (80–100).

Privacy score with the number of upload users (0–20).

Relationship between threshold and privacy score (0–20).

Actual deduplication threshold (0–20).

Privacy score with the number of upload users (1–100).

Relationship between threshold and privacy score (1–100).

Actual deduplication threshold (1–100).
In Figure 4, all users are uploading data with low degree of privacy, but there are still small differences in the specific case in the value of numerical
In Figures 4, 7, and 10, when the number of samples is large enough,
Anti-abuse experiment
The anti-abuse experiment mainly tests the anti-abuse capability of the scheme from the aspect of malicious scoring. We assume that a malicious user deliberately sets the privacy score to 100 when the user privacy score is generally low ((0–20) or sets the privacy score to 0 when the user privacy score is generally high (80–100)) We tested the impact of malicious user scoring in three different settings, in which there are 2%, 3%, and 4% malicious users, respectively. The experimental results are shown in Figures 13 and 14. Figure 13 reflects the situation where some malicious user intentionally chooses a larger privacy score when uploading data with a lower privacy score. Figure 14 reflects the situation where some malicious user intentionally chooses a smaller privacy score when uploading data with a higher privacy score. To facilitate better observation, we assume that malicious users send their own malicious ratings at the beginning. As can be seen from the Figures 13 and 14, the higher the proportion of malicious users is, the greater the impact on the privacy score is. The greater the population of upload users is, the minor affect the malicious score causes. When the number of upload user exceeds 70, the four curves in the figure have little difference, which indicates that the scheme has anti-abuse capability.

The impact of malicious ratings on privacy score (0–20).

The impact of malicious ratings on privacy score (80–100).
Performance comparison
By uploading 1000 files of 10 MB, the total time consumption of our scheme is calculated and compared with that of other schemes, namely the PerfectDedup scheme, the common popularity threshold-based deduplication scheme, and the Xu-CDE scheme. The experiment is repeated for 10 times and the average result is acquired as the final result, which is shown in Figure 15. In the data encryption phase, the time consumption of the four schemes is similar. In the query stage, our scheme has advantages over other schemes that recognize data popularity. Finally, compared with other schemes, our design does not cause additional time overhead while improving the security of the deduplication operation.

Performance comparison of our scheme and other schemes.
Conclusion
In this article, we address the issue of deduplication threshold in the cloud storage scenario and propose a secure deduplication scheme for IoT sensor networks based on threshold dynamic adjustment. The concept of the ideal threshold is proposed for the first time, and the IRT is applied. By uploading the user’s feedback on the data privacy level, the privacy score can be dynamically adjusted, thereby calculating and adjusting the threshold of the deduplication. This scheme can speed up the deduplication stage for data with lower privacy, while data with a higher extent of privacy can be better protected. Experiments show that our scheme not only improves the security of deduplication operation but also avoids additional time overhead. Compared with other schemes, our scheme is more practical for real-world applications.
How to improve the efficiency of data deduplication for IoT applications while ensuring data security will be studied in future works.
Footnotes
Handling Editor: Vishal Sharma
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (61702294), the Shandong Provincial Natural Science Foundation (ZR2019MF058), and the Open Project Program of The State Key Laboratory of Integrated Services Networks (ISN19-14).
