Sage Journals: Discover world-class research

Abstract

In order to rapidly process large amounts of sensor stream data, it is effective to extract and use samples that reflect the characteristics and patterns of the data stream well. In this article, we focus on improving the uniformity confidence of KSample, which has the characteristics of random sampling in the stream environment. For this, we first analyze the uniformity confidence of KSample and then derive two uniformity confidence degradation problems: (1) initial degradation, which rapidly decreases the uniformity confidence in the initial stage, and (2) continuous degradation, which gradually decreases the uniformity confidence in the later stages. We note that the initial degradation is caused by the sample range limitation and the past sample invariance, and the continuous degradation by the sampling range increase. For each problem, we present a corresponding solution, that is, we provide the sample range extension for sample range limitation, the past sample change for past sample invariance, and the use of UC-window for sampling range increase. By reflecting these solutions, we then propose a novel sampling method, named UC-KSample, which largely improves the uniformity confidence. Experimental results show that UC-KSample improves the uniformity confidence over KSample by 2.2 times on average, and it always keeps the uniformity confidence higher than the user-specified threshold. We also note that the sampling accuracy of UC-KSample is higher than that of KSample in both numeric sensor data and text data. The uniformity confidence is an important sampling metric in sensor data streams, and this is the first attempt to apply uniformity confidence to KSample. We believe that the proposed UC-KSample is an excellent approach that adopts an advantage of KSample, dynamic sampling over a fixed sampling ratio, while improving the uniformity confidence.

Keywords

Uniformity confidence sensor stream data random sampling KSample UC-KSample

Introduction

A data stream refers to a continuous form of data that is constantly generated.^1,2 In particular, when a stream has a large capacity, such as sensor data,³ its real-time processing is very costly.⁴ Thus, it is effective to extract and use samples that reflect the characteristics and the pattern of the sensor data stream well.⁵ In this article, we address an effective sampling method in the stream environment of large sensor data.^6,7

Specifically, we address the problem of improving the uniformity confidence^8,9 in KSample. KSample is a random sampling method that dynamically increases the sample size to maintain the fixed sampling ratio for the data stream,¹⁰ and the uniformity confidence is an important measure of how uniformly the sampling algorithm produces samples. In stream sampling, it is impossible to consider all the sensor data continuously due to the problem of memory limitation. Thus, it is important to determine how uniformly the given algorithm generates samples for reliability measurement of the samples. This means that uniformity confidence is an important indicator of the performance of the stream sampling algorithm itself. However, in the case of KSample, which extracts a sample element from a certain-sized stream unit, the range of an element extracted in a memory slot is very limited, and it may also incur low uniformity confidence due to the invariant characteristic of previously selected samples. Also, as the sensor data flows continuously, the increase in amount of “the number of sample cases generated in KSample” becomes smaller than that of “the number of statistically generated sample cases,” so that the relative ratio of two cases decreases gradually, resulting in a problem where uniformity confidence continuously decreases. As described above, uniformity confidence is a very important performance measure in stream environment, and in this article we propose the uniformity confidence-based KSample (UC-KSample in short) to increase the uniformity confidence of KSample in the stream environment of large sensor data.

In order to improve the uniformity confidence of KSample, we first analyze the reason why the uniformity confidence decreases in KSample. According to the analysis, the low uniformity confidence happens for two reasons: (1) a limited range of data streams that can be sampled and (2) no change of the previously sampled data. In this article, we refer to this problem as initial uniformity confidence degradation or simply initial degradation. In addition, since we cannot use all the data as a population, the uniformity confidence continuously decreases as the sample size increases. In contrast to the initial degradation, we refer to this problem as continuous uniformity confidence degradation or simply continuous degradation. We next explain each problem in detail and present its efficient solution.

First, the initial degradation occurs because of two properties: sample range limitation and past sample invariance. The sample range limitation arises because the range of data streams that can be selected for a slot is limited. The past sample invariance arises because the elements already stored in the slot are sent to the secondary storage and cannot be changed. To solve these problems, in UC-KSample, we first alleviate the sample range limitation property by including the already sampled data in the next sample extraction range and eventually increase the number of samples that can be generated. We also alleviate the past sample invariance property by allowing that even an already selected sample can be changed, thereby increasing the number of samples that can be extracted as a result. Since the uniformity confidence is the ratio of “number of all possible sample cases that can be statistically generated” to “number of sample cases that can be generated by a specific algorithm,” we can improve it by increasing the number of samples that can be generated.

Next, the continuous degradation occurs because of the sampling range increase in which the range of streams to be sampled becomes greater than the range of streams that can actually be considered. To solve this problem, we present the concept of UC-window in UC-KSample and explain how to determine the size of UC-window to guarantee a certain uniformity confidence. That is, we determine and manage the window to divide the data stream in consideration of the uniformity confidence. More precisely, when we perform sampling, we ensure that the sampling algorithm has the uniformity confidence always higher than the user-specified threshold by starting a new window before the uniformity confidence becomes lower than the threshold.

Experimental results show that, compared to the existing KSample, the proposed UC-KSample significantly improves the uniformity confidence and provides high sampling accuracy. Specifically, when we fix the user-given threshold to $0.7$ , UC-KSample increases the uniformity confidence about $2.2$ times compared to KSample, regardless of the sampling rate. When we fix the sampling rate to $0.01$ , UC-KSample increases the uniformity confidence $2.0$ times on average, although it is dependent on the threshold. In addition, we confirm that the uniformity confidence of UC-KSample is always higher than the user-specified threshold. Finally, experimental results on sampling accuracy show that UC-KSample maintains z-statistics between $- 0.5$ and $0.5$ for numeric sensor data and the cosine similarity over $0.96$ for text data. These results mean that our UC-KSample produces samples very similar to the population and at the same time provides high uniformity confidence.

The contributions of this article can be summarized as follows. First, we propose a UC-KSample algorithm that combines the uniformity confidence, an important measure in the stream environment of continuous sensor data, with KSample. Second, through thorough analysis of uniformity confidence in KSample, we derive its uniformity confidence degradation problems and present corresponding solutions to alleviate those problems. Third, we present the concept of UC-window that keeps the uniformity confidence always higher than the user-specified threshold, and we also present a formal method of calculating its size. Fourth, through extensive experiments, we show that the proposed UC-KSample maintains high accuracy while providing higher uniformity confidence than the existing KSample. Uniformity confidence is an important indicator of the reliability of the sampling algorithm for the stream environment, where we handle a huge volume of continuous streaming data. Thus, improving the uniformity confidence is a very important research topic in evaluating stream sampling algorithms. To our best knowledge, this is the first attempt to apply the uniformity confidence to KSample, and UC-KSample is an excellent approach that increases the uniformity confidence while maintaining the advantages of KSample, which has high similarity to the population by dynamically sampling over a fixed sampling ratio.

The rest of the article is organized as follows. Section “Related work” describes the related work on sampling algorithms used in stream environments. Section “Background” provides an overview of uniformity confidence and KSample. Section “Uniformity confidence analysis of KSample” analyzes the uniformity confidence of KSample. Section “UC-KSample” explains in detail the concept and operation procedure of UC-KSample proposed in this article. Section “Experimental evaluation” presents the results of experimental evaluation, and finally section “Conclusion” concludes the article.

Related work

Sampling techniques used in the stream environments are classified into (1) fixed sampling size and (2) fixed sampling ratio methods according to how to specify the sample amount.¹¹ The fixed sampling size method fixes the sample size regardless of the stream amount, and the fixed sampling ratio method fixes the sampling rate by changing the sample size according to the stream amount. The typical algorithms for the fixed sampling size are Reservoir,¹² priority,¹³ and stratified⁸ sampling techniques; those for the fixed sampling ratio are KSample,¹⁰ Naive UC-KSample,¹⁴ systematic,¹⁵ and hash¹⁶ sampling techniques.

Reservoir sampling¹² is a sampling method that applies random sampling to the stream environment. When the data stream inflows infinitely, it samples a fixed number of elements, and the probability that each element is included in the sample is equal to (1/stream length). It is commonly used in many areas because its implementation is simple and has random sampling features.

Priority sampling¹³ assigns a weight to each element of the data stream and samples an element having a high weight. That is, a data element having high occurrence has high probability to be sampled, and a data element having low occurrence has low probability to be sampled. In general, priority sampling is known to provide high reliability of the sample by showing the similar probability distribution between the sample and the population.

Stratified sampling⁸ divides an input population into non-overlapping layers and extracts samples from each layer. Since each layer has different properties, this sampling method can reflect the characteristics of each layer as well as the characteristics of the entire population. Also, stratified sampling shows the higher accuracy than random sampling in many cases because it forms the entire sample by extracting sub-samples from each layer.

KSample¹⁰ dynamically increases the sample size in the stream environment to keep the sampling rate constant for the input stream. It receives the sampling rate $p$ from the user and selects samples so that the sample size is always kept at $P % (= p \times 100)$ or more of the data stream. Since KSample dynamically increases the sample size according to the sampling rate, there is no need to define the sample size in advance. It can also use the memory effectively because the already sampled elements do not change.

Naive UC-KSample¹⁴ is a recent algorithm that increases the uniformity confidence of KSample. In other words, it is a sampling method that improves the uniformity confidence of KSample by alleviating the sample range limitation and the past sample invariance while keeping the sampling rate constant for the input stream. However, it has a critical problem in that uniformity confidence continuously decreases over time. UC-KSample to be proposed in this article quite differs from Naive UC-KSample since it solves the continuous degradation problem above by introducing a uniformity confidence threshold and using the UC-window to satisfy the threshold.

Systematic sampling¹⁵ selects the first element randomly and then samples every k-th element repeatedly. The sampling interval $k$ is determined by the sampling rate $p$ given by the user. For example, if the sampling rate $p$ is set to $0.1$ , one of 10 elements is sampled, so that the sampling interval $k$ becomes 10. In general, the samples selected by systematic sampling are similar to those of random sampling, so the easier and faster systematic sampling is widely used instead of the random sampling. However, if the data stream has periodicity, systematic sampling has the disadvantage of being able to extract highly distorted samples.

Finally, hash sampling¹⁶ uses a hash function to select samples. It first applies a hash function to each data element, then puts the element in the bucket corresponding to its hash value, and finally selects a specific bucket as a sample. Hash sampling shows good performance in complex data such as IP packet and log data. However, it has a difficulty in determining a good hash function for the given stream data.

Assuming that sensor data is infinitely generated in the stream environment, the fixed ratio approach at which the sample amount varies depending on the input stream amount is more effective than the fixed size approach. In this article, we focus on KSample among the fixed sampling ratio methods, which has the characteristics of random sampling for sensor stream data.

Background

Uniformity confidence

Uniformity confidence is an important measure of a sampling algorithm that represents how many sample cases are considered. It is based on the definition that a sample is said to be a uniform random sample if produced by a sampling algorithm in which all statistically possible samples of the same size are equally likely to be selected.⁹ The uniformity confidence is calculated as a ratio of “number of all possible sample cases that can be statistically generated” and “number of sample cases that can be generated by a specific algorithm” as in equation (1)

U C = \frac{# of sample cases to be generated by a specific algorithm}{# of all possible sample cases to be statistically generated} \times 100 (%)

(1)

For example, as shown in Figure 1(a), when we want to choose three out of 10 data items randomly, the uniformity confidence is $100 % (= \frac{(\begin{matrix} 10 \\ 3 \end{matrix})}{(\begin{matrix} 10 \\ 3 \end{matrix})} \times 100)$ if we consider all 10 items. This is because its sampling scheme of Figure 1(b) considers all possible sample cases that can be statistically generated. On the other hand, as shown in Figure 2(a), if we need to extract samples only from items 4 to 10, that is, if we can include only seven items in the sample extraction range, the uniformity confidence is $29.17 % (= \frac{(\begin{matrix} 7 \\ 3 \end{matrix})}{(\begin{matrix} 10 \\ 3 \end{matrix})} \times 100)$ . This is because, as shown in Figure 2(b), its sampling scheme does not include items 1 to 3. As we can see in Figures 1 and 2, in order to improve the uniformity confidence of the sampling algorithm, we need to consider as much data as possible when performing sampling.

Figure 1.

(a) Sampling and its (b) scheme when the uniformity confidence = 100%.

Figure 2.

(a) Sampling and its (b) scheme when the uniformity confidence <100%.

In the batch environment where the population is fixed, we can maintain the uniformity confidence at 100% since we can perform sampling considering all data items. However, in a stream environment in which sensor data is generated in real time, the uniformity confidence is less than 100% because we can consider only the data temporarily stored in the memory at present rather than all the data due to the real-time processing and the limited memory problem. Al-Kateb et al.^8,9 proposed a sampling method to keep the uniformity confidence above the lower limit when the sample size changes in the stream environment. They also proposed an efficient way of maintaining the uniformity confidence above the lower limit for join queries in the stream environment.¹⁷ Ting¹⁸ proposed a sampling technique that satisfies the constraint conditions such as memory and sample amount and uses the uniformity confidence as a measure for evaluating the sampling algorithm. The uniformity confidence is a measure of how much data the sampling algorithm considers in the limited memory space.^8,9 This means that the uniformity confidence shows the reliability of samples when all data cannot be taken into consideration, especially in a continuous stream environment. Thus, the uniformity confidence itself is an indicator of the performance of the stream sampling algorithm, and supporting it at a high level is an important research issue. Therefore, when all the data, such as the stream environment, cannot be sampled, how much data the sampling algorithm considers to improve the uniformity confidence is an important performance factor.

KSample

KSample is a random sampling method that dynamically increases the sample size in the stream environment so as to keep the sampling ratio for the input stream constant.¹⁰ KSample gets the sampling rate $p (\in [0, 1])$ from the user and performs sampling so that the sample size is always kept at $P % (= p \times 100)$ or more of the data stream. That is, it dynamically increases the sample size to keep the $P %$ of the constantly inflowing data stream as a sample. For example, given $p = 0.1$ , KSample dynamically increases the sample size as the stream flows, keeping at least 10% of the input stream as a sample.

Figure 3 shows the operation procedure of KSample, and its detailed explanation is as follows:

Stream input. The data source inputs a data stream to be sampled.

Sample size and sampling ratio verification. When a data stream is input, KSample checks whether the current sample size is $P %$ or more of the data stream. If the current sample size is less than $P %$ of the data stream, (3) after creating a slot, it (4) performs the sampling and (5) stores the previous slot in the secondary storage. Otherwise, that is, if the current sample size $\geq P %$ , it merely performs the sampling in (4).

Slot creation. It creates one slot that is a single element memory space added to the sample.

Sampling. KSample performs sampling on the current slot. Specifically, if the probability that a data stream element is selected in the slot is greater than or equal to a random value, it inserts the corresponding element into the slot.

Secondary storage. It stores the sample element in the previous slot created before the current slot into the secondary storage.

Figure 3.

Operation procedure of KSample.

Figure 4 shows the KSample algorithm. KSample receives the sampling ratio $p$ and the data stream sample as inputs and starts sampling with the empty sample list reservoir. If the current sample size becomes less than $P %$ of the data stream, that is, less than $(p \times sLength)$ , KSample creates a new slot so as to increase the sample size (Lines 5–8). Then, it computes the probability that a new stream element is selected in the current slot and generates a random value between 0 and 1 (Lines 9 and 10). If the generated random value is less than or equal to the probability to be selected in the slot, KSample inserts the stream element into the current slot (Lines 11 and 12). It repeats these steps every time the data stream elements occur one by one. If there is a memory constraint, we store the previous slot in the secondary storage before allocating a new slot (Line 6). For the detailed algorithm of KSample, readers are referred to Kepe et al.¹⁰

Figure 4.

KSample algorithm.¹⁰

Example 1

Figure 5 shows the operation procedure of KSample when the sampling ratio is $0.3$ . When the first stream element E1 comes in, we create a new slot Slot1 because the sample size does not hold 30% of the stream length. Here, the probability $P (E 1)$ that we choose E1 as a sample is $1$ because E1 is the first element after creating the current slot. In the figure, the random value of E1 is $0.6$ , which is less than $P (E 1)$ , and thus, we insert E1 into Slot1. For the second and third elements E2 and E3, we do not create a new slot because the sample size is greater than 30% of the stream length. Thus, the probabilities $P (E 2)$ and $P (E 3)$ , which we select E2 and E3 as samples, are $1 / 2$ and $1 / 3$ , respectively. In this example, $P (E 2) = 1 / 2$ is larger than E2’s random value of $0.3$ , so we replace E1 with E2 in Slot1. On the other hand, since $P (E 3) = 1 / 3$ is less than E3’s random value of $0.4$ , we do not insert E3 into Slot1. When the fourth element E4 comes in, we create a new slot Slot2 because the sample size becomes less than 30% of the data stream. Since a new slot has been created, the probability $P (E 4)$ is calculated as $1$ , equal to $P (E 1)$ . E4’s random value of $0.6$ is less than $P (E 4) = 1$ , so we insert E4 into Slot2. We repeat this sampling process every time data stream elements come in one by one.

Figure 5.

A sampling example of the operation procedure in KSample when $p = 0.3$ .

KSample has the advantages that it does not need to predefine the sample size according to the fixed sampling ratio principle and uses the memory efficiently due to no change of the previously sampled elements. However, we note here that the memory efficiency, which is an advantage of KSample, causes the large degradation of uniformity confidence. That is, since the data sampled in the previous slot already move to the secondary storage and is excluded from the sampling range of the next slot, the number of sample cases decreases and the uniformity confidence also decreases. Therefore, in this article, we propose UC-KSample which improves the uniformity confidence while preserving KSample’s advantages, which are the memory efficiency and the fixed sampling ratio.

Uniformity confidence analysis of KSample

For sampling in the real-time stream environment, we can consider only the data stored in the current memory rather than all the data due to a memory limitation problem. Like this, we cannot include all the data in the sampling range, so we need an objective measure for evaluating the sampling performance. The uniformity confidence used in this article is one of the major criteria for evaluating the sampling performance, and improving the uniformity confidence is an important research issue.

As described earlier, KSample is a useful sampling algorithm for stream environments because it increases the sample size dynamically so as to maintain the fixed sampling ratio. However, KSample causes a memory loss problem that cannot consider all data streams when sampling. Such memory loss causes a serious problem where uniformity confidence is very low and continuously decreases.

Figure 6 shows the uniformity confidence degradation problem of KSample in detail. The figure illustrates a graph comparing the uniformity confidence of Reservoir sampling with that of KSample $(p = 0.01)$ . (In Figure 6, Reservoir sampling always keeps the uniformity confidence at 100%. Thus, if we set the sample size by dividing the input stream into windows according to the sampling ratio, Reservoir sampling can always maintain the uniformity confidence at 100%. However, this method has a problem that it cannot set the sampling ratio to $P %$ at a certain time. Over time, this ratio is negligible but theoretically not rigorous. Therefore, in this article, we do not consider to apply Reservoir sampling per window and maintain the sampling ratio, and we leave the analysis of this approach as a future study.) As shown in the figure, the uniformity confidence of Reservoir sampling is always maintained at 100%. This is because Reservoir sampling does not change the sample size and uses all the stream data as the sampling range. On the other hand, in the case of KSample, the uniformity confidence is remarkably low due to the memory loss problem. In particular, the uniformity confidence of KSample decreases continuously from the beginning of the sampling. In KSample of Figure 6, we can see the sharp initial degradation of ⓐ and the continuous degradation of ⓑ. In this article, we refer to the problem of ⓐ as the initial (uniformity confidence) degradation and that of ⓑ as the continuous (uniformity confidence) degradation, respectively.

Figure 6.

Comparison of uniformity confidence between KSample and Reservoir sampling.

In this article, we first analyze the initial and continuous degradation problems of uniformity confidence in KSample. We next derive three reasons of sample range limitation, past sample invariance, and sample range increase, which cause these degradations. We also present sample range extension, past sample change, and use of UC-window, as the requirements for resolving those reasons, respectively. We then propose UC-KSample, which improves the uniformity confidence of KSample by reflecting these three requirements, in the next section.

UC-KSample

In this section, we analyze the causes of uniformity confidence degradation in KSample and propose requirements to solve those problems. We then propose UC-KSample, which maintains high uniformity confidence of KSample, and explain its working procedure in detail using examples.

Analysis of problem causes

As mentioned earlier, KSample has two problems causing the uniformity confidence degradation. First, the initial degradation of KSample is due to two causes. The first cause comes from the limited range of data streams that can be selected for a particular slot is limited. KSample dynamically increments the sample size by one if the sample size is less than $P %$ of the data stream to maintain the fixed sampling ratio. At this time, the range of the data stream that can be stored in each slot is limited.

Example 2

Figure 7 shows an operation procedure of KSample when the sampling ratio $p = 0.3$ and nine stream elements come in. In particular, Figure 7(a) illustrates the sample extraction range of each slot. As shown in the figure, the sample extraction range of the first slot is limited to ①–③, that of the second slot to ④–⑥, and that of the third slot to ⑦–⑨. Like this, the sample extraction range for a particular slot is limited, and the resulting sampling scheme appears as shown in Figure 7(b), which is much smaller than the number of statistically possible sample cases.

Figure 7.

(a) Operation procedure of KSample for $p = 0.3$ and nine stream elements and (b) sampling scheme.

As shown in Example 2, the uniformity confidence, which was 100% for the sample size of 1, sharply decreases for the sample size of 2 or more at the beginning of sampling. This initial degradation is due to the sample range limitation.

The second cause of the initial degradation is that the already sampled data moves to the secondary storage and does not change. Stream elements included in the sampling range of the slot compete to be selected as samples in that slot. However, when the range of the data stream for a current slot is over and a new slot is created, the already selected data moves to the secondary storage and does not change as shown in Figure 7(a). In this way, KSample cannot change the previously sampled data, and it reduces the number of sample cases to be considered. In particular, KSample decreases rapidly in the number of sample cases at the beginning of sampling, and the uniformity confidence decreases rapidly. In this article, we describe the two causes of the initial degradation presented above in detail as follows:

Sample range limitation. The range of stream elements for a specific sample slot is limited, and thus, the uniformity confidence decreases.

Past sample invariance. After completing the sampling for previous slots, the already selected samples do not change, and thus, the uniformity confidence decreases.

Second, we discuss the continuous degradation problem of KSample. The continuous degradation occurs because the increased number of sample cases in KSample is smaller than that of statistically possible sample cases. More specifically, as the input stream increases, the population increases as well, but due to memory loss and sample range limitation, KSample cannot use all the input stream as the sample extraction range. Thus, the ratio of the number of sample cases by KSample to the number of statistically possible sample cases continuously decreases over time.

Example 3

Figure 8 shows the operation procedure of KSample and the result of uniformity confidence calculation at the sampling ratio $p = 0.3$ . In the figure, when the data stream length is $3$ and the sample size is $1$ , the uniformity confidence is $100 % (= \frac{(\begin{matrix} 3 \\ 1 \end{matrix})}{(\begin{matrix} 3 \\ 1 \end{matrix})} \times 100)$ . As the data stream length increases to $6$ , the population increases and the number of statistically possible sample cases becomes $15 (= (\begin{matrix} 6 \\ 2 \end{matrix}))$ . However, due to memory loss and sample range limitation, the number of sample cases by KSample is only $9 (= (\begin{matrix} 3 \\ 1 \end{matrix}) \times (\begin{matrix} 3 \\ 1 \end{matrix}))$ . Thus, for the data stream length of $6$ , the uniformity confidence decreases to $60 % (= \frac{(\begin{matrix} 3 \\ 1 \end{matrix}) (\begin{matrix} 3 \\ 1 \end{matrix})}{(\begin{matrix} 6 \\ 2 \end{matrix})} \times 100)$ . In this way, for the data stream length of $9$ , the uniformity confidence becomes $32.1 % (= \frac{(\begin{matrix} 3 \\ 1 \end{matrix}) (\begin{matrix} 3 \\ 1 \end{matrix}) (\begin{matrix} 3 \\ 1 \end{matrix})}{(\begin{matrix} 9 \\ 3 \end{matrix})} \times 100)$ , and as the input stream increases, the uniformity confidence gradually decreases.

Figure 8.

Operation procedure and uniformity confidence calculation at $p = 0.3$ in KSample.

In this article, we describe the cause of continuous uniformity confidence degradation as follows:

Sampling range increase. Since the increase in the number of sample cases in KSample is smaller than that of statistically possible sample cases, the relative ratio of these two cases gradually decreases.

The analysis results so far are summarized as follows. First, the initial degradation of KSample incurs by the sample range limitation and the past sample invariance. Second, the continuous degradation incurs by the sample range increase. In the following subsection, we define requirements to solve these two degradation problems.

Requirements for UC-KSample

First, we refer to the requirement for the sample range limitation, which is a cause of the initial degradation, as the sample range extension. The sample range extension is to extend the range of stream data that can be extracted into a particular sample slot to the already sampled elements. As the population range considered in the sampling increases, the number of sample cases also increases, so the uniformity confidence eventually increases. Figure 9 shows the operation procedure of KSample and UC-KSample when the sampling ratio $p = 0.3$ and six stream elements come in. As shown in Figure 9(a), KSample limits the sampling range of the first sample to elements ①–③ and that of the second sample to elements ④–⑥. In contrast, as shown in Figure 9(b), UC-KSample extracts two samples from four elements ①, ④, ⑤, and ⑥ by including the first selected sample ① in the second sample extraction range.

Figure 9.

Operation procedure of (a) KSample and (b) UC-KSample for $p = 0.3$ and six elements.

Second, we call the requirement for solving the past sample invariance the past sample change. The past sample change makes it possible to change even the already sampled data. Allowing the replacement of already sampled data with other stream data increases the number of extractable sample cases, and thus the past sample change can eventually improve the uniformity confidence. As shown in Figure 9(a), after creating the second slot, KSample moves the element ① already selected in the first slot to the secondary storage and does not change it. On the other hand, UC-KSample does not move ① to the secondary storage but allows it to remain in main memory until certain requirements meet, and thus, it can include ① into the next sampling population.

Third, we use the UC-window as the requirement for solving the sampling range increase which causes the continuous degradation. (If we ignore the user-specified threshold, that is, set $ϵ$ to 0, the UC-window size becomes infinite according to equation (2). If the window size is infinite, the uniformity confidence of UC-KSample is certainly higher than KSample but still decreases continuously. In this article, we call this method Naive UC-KSample,¹⁴ which sets the window size to infinity in UC-KSample, and use it for performance comparison in the experiment.) The use of UC-window is to calculate the window size in which the uniformity confidence is always kept above a given threshold $(ϵ)$ , and to perform sampling by dividing the input stream into window units. Lemma 1 shows how to calculate this UC-window size.

Lemma 1

Given the sampling ratio $p$ and the threshold $ϵ$ , the UC-window size is determined by the sum of the maximum $k$ and maximum $m$ satisfying equation (2)

ϵ \leq \frac{\sum_{x = max {0, (⌈ kp ⌉ + 1) - m}}^{kp} (\begin{matrix} x \end{matrix}) (\begin{matrix} m \\ ⌈ kp ⌉ + 1 - x \end{matrix})}{(\begin{matrix} k + m \\ ⌈ kp ⌉ + 1 \end{matrix})} \times 100

(2)

In equation (2), $k$ is the stream size from the beginning of the window to the present, and $m$ is the maximum input stream size that can be input when the sample size increases by one.

Proof

UC-KSample works as shown in Figure 10 when the sample size increases by one. When the input stream size up to date is $k$ , UC-KSample extracts $⌈ kp ⌉$ samples to keep the sample size $(p \times 100) %$ of the input stream data. After then, when more $m$ data elements come in and the sample size increases by $1$ , the range of $m$ is in $[1, 1 / p]$ . If we want to extract $⌈ kp ⌉ + 1$ samples from $k + m$ elements, the number of statistically possible sample cases becomes $(\begin{matrix} k + m \\ ⌈ kp ⌉ + 1 \end{matrix})$ . Since UC-KSample supports the sample range extension, we extract $⌈ kp ⌉ + 1$ samples from $m + ⌈ kp ⌉$ stream elements. Also, since UC-KSample supports the past sample change, we can change $x$ samples among already selected $⌈ kp ⌉$ samples. If the number $m$ of new stream elements is less than $⌈ kp ⌉ + 1$ , then $x$ must be less than $⌈ kp ⌉ + 1 - m$ . Therefore, $x$ ranges from $max {0, (⌈ kp ⌉ + 1) - m}$ to $⌈ kp ⌉$ . In summary, the number of samples that can be generated by UC-KSample is $\sum_{x = max {0, (⌈ kp ⌉ + 1) - m}}^{⌈ kp ⌉} (\begin{matrix} k \\ x \end{matrix}) (\begin{matrix} m \\ ⌈ kp ⌉ + 1 - x \end{matrix})$ . Since the uniformity confidence is the ratio of “number of all statistically possible sample cases” to “number of sample cases by a specific algorithm” as in equation (1), we can compute the uniformity confidence of UC-KSample using equation (2). Even though the sample size increases, the uniformity confidence must always be greater than the given threshold. Therefore, after obtaining the maximum $k$ and maximum $m$ satisfying equation (2), we set the sum of those $k$ and $m$ to the size of UC-window.

Figure 10.

Working procedure of UC-KSample when the sample size increases by one.

Example 4

Figure 11 shows the operation procedure of UC-KSample when the sampling ratio $p = 0.2$ and the given threshold $ϵ = 70 %$ . We can obtain the UC-window size satisfying $p = 0.2$ and $ϵ = 70 %$ as $30$ by equation (2). Therefore, we keep the sample size at $20 %$ of the input stream until the number of input elements reaches $30$ and perform sampling up to six times. Then, if the input stream size exceeds $30$ , we create a new window and perform sampling using that window.

Figure 11.

Operation procedure of UC-KSample when $p = 0.2$ and $ϵ = 70 %$ .

Table 1 summarizes the two problems of KSample, their causes, and the requirements of UC-KSample for resolving those problems.

Table 1.

Problems and their causes in KSample and corresponding requirements for UC-KSample.

Requirement	Description	Cause	Related problem
Sample range extension	Extend the sampling range for a particular slot to the already sampled elements.	Sample range limitation	Initial degradation of uniformity confidence
Past sample change	Allow to be replaced by new incoming elements even for already sampled elements.	Past sample invariance	Initial degradation of uniformity confidence
Use of UC-window	Maintain the uniformity confidence over the user-specified threshold using the UC-window.	Sampling range increase	Continuous degradation of uniformity confidence

UC-KSample algorithm

We now present the algorithm and operation procedure of UC-KSample and give an intuitive example. Figure 12 shows the operation steps of UC-KSample, and the detailed explanation of each step is as follows:

Window size computation. We calculate the UC-window size that always satisfies the user-specified threshold $ϵ$ .

Stream input. The data source inputs a data stream to be sampled.

Comparison of window size and stream length. We compare the window size with the stream length that has occurred to date. If the window $size \geq$ the stream length, proceed to (6) sample size and sampling ratio verification. On the other hand, if the window $size <$ the stream length, (4) store the samples generated in the current window in the secondary storage and (5) create a new window so as to maintain the uniformity confidence above the threshold.

Secondary storage. We store the samples generated in the current window to the secondary storage.

Window creation. We create a new window to maintain the uniformity confidence above the given threshold.

Sample size and sampling ratio verification. When a data stream is input, we check whether the current sample size is $P %$ or more of the data stream. If the current sample size is less than $P %$ of the data stream, we additionally perform (7) slot creation otherwise, we just perform sampling in (8).

Slot creation. We increment the sample size by creating a new slot, which is a single element memory space added to the sample.

Sampling. We perform sampling on the current slot. Specifically, if the probability that a data stream element selected in the slot $is \geq$ a random value, we insert the corresponding element into the slot.

Figure 12.

Operation procedure of UC-KSample.

Compared with KSample’s operation procedure in Figure 3, we can see that (1) window size computation, (3) comparison of window size and stream length, and (5) window creation are added to UC-KSample for use of UC-window.

Figure 13 shows the proposed algorithm of UC-KSample. The inputs are the sampling ratio $p$ , the input stream stream, and the threshold $ϵ$ , and UC-KSample starts the sampling with reservoir, an empty sample list. First, we calculate the UC-window size $w$ satisfying the given threshold $ϵ$ by equation (2) (Line 3). If the length sLength of the current stream is $1$ or the stream length wLength entered in the current window is larger than the window size $w$ , we store the generated samples in the secondary storage, and create a new window to start sampling again (Lines 7–9). In Line 10, when a data stream element arrives, we generate a random value rand for each element. If the current sample size is smaller than $(p \times wLength)$ , we increase the sample size by one to maintain the fixed sampling ratio $p$ , and add the current incoming element to the sample along with the random value (Lines 11–13). On the other hand, if the current sample size is not larger than $(p \times wLength)$ , we compare the random value of the current element with the element having the smallest random value in the sample list to check whether we replace the smallest one with the current one (Lines 15–19). We repeat this process until the stream input ends or the user stops sampling (Lines 4–19).

Figure 13.

UC-KSample algorithm.

Example 5

Figure 14 shows the operation procedure of UC-KSample when the sampling ratio $p = 0.5$ and the UC-window size $w = 4$ that satisfies the user-specified threshold. First, when the first stream element E1 comes in, we create Win1 and Slot1. Slot1 is initially empty, so we insert E1 into Slot1 with a random value of $0.6$ . When the second element E2 comes in, we compare E2’s random value with the maximum random value $0.6$ stored in Win1. If E2’s random value is smaller than $0.6$ , we replace E1 with E2. At this time, we do not create a new window or a new slot because the sample size satisfies the sampling ratio and the stream length does not exceed the window size. When the third element E3 comes in, we create a new slot Slot2 because the sample size must be above $50 %$ of the input stream. We then insert E3 into Slot2 since it is empty. Similarly, when E4 comes in with a random value of $0.2$ , E4’s random value is smaller than the maximum random value of $0.5$ stored in the window, and thus, we replace E2 with E4. Next, when E5 comes in, it exceeds the UC-window size $4$ , so we create a new window Win2. Then, we store the samples stored in the previous window, Win1, in the secondary storage, and perform the sampling on the newly created Win2.

Figure 14.

A sampling example of the operation procedure in UC-KSample when $p = 0.5$ and $w = 4$ .

Experimental evaluation

Experimental environment and data

We perform the experimental evaluation of KSample, Naive UC-KSample (UC-KSample with infinite window size), and UC-KSample. In the first experiment, we compare the uniformity confidence of UC-KSample with the other two algorithms by varying the sampling ratio and the user-specified threshold. In the second experiment, we evaluate the sampling accuracy instead of the uniformity confidence for different sampling ratios and thresholds. In the third and fourth experiments, we measure the secondary storage usage and execution time, respectively, by varying the sampling ratio.

For the accuracy test, we use three data sets: Gaussian data, temperature sensor data from nuclear power plants (Nuclear data in short), and Twitter hash tag data (Tweet data in short). Table 2 describes the three datasets used in the experiment. In Table 2, Gaussian data is used in various studies^8,21 as the main experimental data for the stream environment. We use the Gaussian data generator that produces data items one by one according to the given mean and standard deviation and adds those items in real time as the input stream. The hardware platform used for the accuracy experiment is an HP workstation equipped with Intel i7 3.60 GHz CPU, 8.0GB RAM, and the software platform is the Ubuntu 16.04 LTS operating system.

Table 2.

Three datasets used in the experiment.

Dataset	Description
Gaussian data	One million real number data randomly generated with an average of 0 and a standard deviation of 1.
Nuclear data¹⁹	About 320,000 real-time sensor data that collect the temperatures of five nuclear power plants in 1-min increments.
Tweet data²⁰	About 400,000 text data that extract hash tags from English Tweets collected every minute.

In a stream environment, the order of data input affects the sampling result.⁵ However, this is a problem with all sampling techniques, and random sampling may also show different results depending on the order. Both KSample and the proposed UC-KSample inherit the characteristics of random sampling, and thus, they also show different sampling results by different input orders. However, if there is a large amount of input data, there is no significant difference between the data input order and the probability distribution of the sampling results. Thus, we do not consider the input order in the experimental results.

Experimental results on uniformity confidence

Here, we measure and compare uniformity confidence of KSample, Naive UC-KSample, and UC-KSample with different sampling ratio $p$ and threshold $ϵ$ . First, Figure 15 shows the comparison results when $p = 0.01$ and $ϵ = 0.7$ . Figure 15(a) shows that Naive UC-KSample and UC-KSample provides higher uniformity confidence than KSample. In addition, we note that UC-KSample maintains the uniformity confidence always over $70 %$ . We can also note that rapid fluctuation occurs in Figure 15(a). This fluctuation is a phenomenon that occurs to keep the uniformity confidence above the user-specified threshold. As we explained earlier, using equation (2) we calculate the window size so that the uniformity confidence is maintained above the threshold, and if the number of input streams exceeds the window size, we create a new window to increase the uniformity confidence. The uniformity confidence is the highest at the start of the window, that is, when a new window is created and the lowest at the end of the window. Therefore, rapid fluctuations in Figure 15(a) are due to creating new windows so as to maintain the given threshold. In Figure 15(b), we alleviate the fluctuations effect of UC-KSample by computing an interval average (an average of a fixed interval) and using the average as the result. That is, we use the uniformity confidence of KSample and Naive UC-KSample as it is, but in UC-KSample, we compute an interval average per window and represent the average as the result. As shown in the figure, the uniformity confidence of UC-KSample is always higher than those of KSample and Naive UC-KSample, and it is always higher than the threshold of $70 %$ . In summary of Figure 15, the uniformity confidence of UC-KSample is about $2.2$ times higher than that of KSample and $1.3$ times higher than Naive UC-KSample.

Figure 15.

Uniformity confidence comparison when $p = 0.01$ and $ε = 0.7$ : (a) uniformity confidence of three sampling algorithms and (b) interval averaged uniformity confidence for UC-Ksample.

Figure 16 shows the results when we use the same $p = 0.01$ , but decrease $ϵ$ to $0.65$ . Figure 16(a) shows the uniformity confidence of the three algorithms according to the input stream length, and Figure 16(b) shows the interval averages for UC-KSample. Since the sampling ratio is $0.01$ as in Figure 15, the uniformity confidence of KSample and Naive UC-KSample is the same with Figure 15. On the other hand, because the threshold of $0.65$ is smaller than that of Figure 15, the window size becomes larger, and the uniformity confidence of UC-KSample also decreases compared with Figure 15. And we note that UC-KSample always keeps the uniformity confidence over $65 %$ . In summary of Figure 16, the uniformity confidence of UC-KSample is about $1.9$ times higher than that of KSample and $1.1$ times higher than Naive UC-KSample.

Figure 16.

Uniformity confidence comparison when $p = 0.01$ and $ϵ = 0.65$ : (a) uniformity confidence of three sampling algorithms and (b) interval averaged uniformity confidence for UC-Ksample.

Figure 17 shows the results when we keep $ϵ = 0.7$ as in Figure 15, but increase $p$ to $0.1$ . Since the sampling ratio $p$ differs from Figure 15, the window size and uniformity confidence differ from those in Figure 15, but the overall trends are very similar. Also, UC-KSample always maintains the uniformity confidence over 70%.

Figure 17.

Uniformity confidence comparison when $p = 0.1$ and $ϵ = 0.7$ : (a) uniformity confidence of three sampling algorithms and (b) interval averaged uniformity confidence for UC-Ksample.

Figure 18 shows the relative results for various user-specified thresholds and sampling ratios. In Figure 18(a), we fix $ϵ = 0.7$ and change $p$ to $0.01$ , $0.025$ , and $0.05$ , respectively. As shown in the figure, UC-KSample improves the uniformity confidence about $2.16$ times over KSample and about $1.3$ times over Naive UC-KSample, regardless of the sampling ratio. In Figure 18(b), we fix $p = 0.01$ and change $ϵ$ to $0.65$ , $0.67$ , and $0.70$ , respectively. As shown in the figure, although it depends on the user-specified threshold, Naive UC-KSample and UC-KSample improve the uniformity confidence about $1.2$ times and $2.0$ times over KSample, respectively.

Figure 18.

Relative comparison of uniformity confidence by varying the user-specified threshold and the sampling ratio: (a) relative results for $ϵ = 0.7$ and different $p$ and (b) relative results for $p = 0.01$ and different $ϵ$ .

Experimental results on sampling accuracy

We now evaluate the sampling accuracy of the three algorithms. The accuracy, which is a comparative measure of this experiment, means the similarity between the population and the sample. Because Gaussian data and Nuclear data are numerical data following a normal distribution, we use the z-statistics²² to measure the similarity. On the other hand, we can regard Tweet data as text documents, so we use the cosine similarity^23,24 to evaluate the frequencies of hash tags. For each experimental case, we perform sampling $50$ times and use their average as its accuracy measure. Table 3 shows the z-statistics of KSample, Naive UC-KSample, and UC-KSample using Gaussian data. In this experiment, we vary $ϵ$ of UC-KSample to $0.650$ , $0.675$ , and $0.700$ , respectively, and increase the sampling ratio from $0.01$ to $0.10$ by $0.01$ intervals. The table shows that the accuracy of UC-KSample is high in many cases, though it depends on the sampling ratio. We note here that all z-statistics are between $- 0.5$ and $0.5$ regardless of sampling ratio and algorithm. This result represents that the average of the samples generated by each algorithm is within $0.5 σ$ of the population, and this means that all the samples are very similar to the population.

Table 3.

z-statistics results on Gaussian data.

Ratio $p$	Algorithm
	KSample	Naive UC-KSample	UC-KSample
			$ϵ = 0.650$	$ϵ = 0.675$	$ϵ = 0.700$
$0.01$	$0.3631$	$0.1546$	$0.1893$	$0.1262$	$0.1749$
$0.02$	$0.2709$	$0.1444$	$0.0817$	$0.1824$	$0.0671$
$0.03$	$0.0295$	$0.0581$	$0.1820$	$0.2207$	$0.2381$
$0.04$	$0.3734$	$0.1694$	$0.1823$	$0.3649$	$0.3486$
$0.05$	$0.1891$	$0.2833$	$0.3526$	$0.3640$	$0.2010$
$0.06$	$0.3214$	$0.3209$	$0.3177$	$0.4241$	$0.3988$
$0.07$	$0.4783$	$0.3760$	$0.1698$	$0.2844$	$0.2823$
$0.08$	$0.4213$	$0.3804$	$0.2591$	$0.1788$	$0.2206$
$0.09$	$0.3833$	$0.3541$	$0.2945$	$0.4097$	$0.4687$
$0.10$	$0.4286$	$0.3714$	$0.3718$	$0.4695$	$0.4064$

Bolded terms emphasize the z-statistics closest to 0.

Tables 4 and 5 show the z-statistics of Nuclear data and the cosine similarity of Tweet data, respectively. In both experiments, we change $p$ to $0.01$ , $0.05$ , $0.10$ , and $ϵ$ to $0.650$ , $0.675$ , $0.700$ . Nuclear data of Table 4 shows that the z-statistics are between $- 0.5$ and $0.5$ , similar to Table 3, and we can conclude that the samples generated by all three algorithms are very similar to the population. Also, there are no meaningful differences in the cosine similarity of Tweet data in Table 5 according to the algorithm, and all similarity values are over $0.96$ , which means that the samples generated by the three algorithms are very similar to the population. The accuracy test described so far means that all the three algorithms show high accuracy for both synthetic and actual data regardless of the sampling ratio. Therefore, we believe that the proposed UC-KSample is an excellent approach that improves the uniformity confidence without sacrificing the sampling accuracy.

Table 4.

z-statistics results on Nuclear data.

Ratio $p$	Algorithm
	KSample	Naive UC-KSample	UC-KSample
			$ϵ = 0.650$	$ϵ = 0.675$	$ϵ = 0.700$
$0.01$	$- 0.2072$	$- 0.2602$	$- 0.2720$	$- 0.2152$	$- 0.2924$
$0.05$	$- 0.1147$	$- 0.1894$	$- 0.0365$	$0.0309$	$- 0.1286$
$0.10$	$- 0.1642$	$- 0.1949$	$0.2856$	$- 0.1362$	$- 0.0838$

Table 5.

Cosine similarity results on Tweet data.

Ratio $p$	Algorithm
	KSample	Naive UC-KSample	UC-KSample
			$ϵ = 0.650$	$ϵ = 0.675$	$ϵ = 0.700$
$0.01$	$0.9678$	$0.9673$	$0.9670$	$0.9677$	$0.9675$
$0.05$	$0.9935$	$0.9935$	$0.9936$	$0.9935$	$0.9936$
$0.10$	$0.9969$	$0.9969$	$0.9969$	$0.9969$	$0.9970$

Experimental results on secondary storage and execution time

Here we measure and compare the secondary storage space required in three algorithms. Table 6 shows the secondary storage space required in KSample, UC-KSample, and Naive UC-KSample, respectively, for Gaussian data. In the experiment, we use $0.01, 0.05, 0.10$ as the sampling rate $p$ and set the threshold $ϵ$ to $0.7$ in UC-KSample. As shown Table 6, the size of the entire Gaussian data is 20,136 KB, which is represented as non-sampling. When the sampling rate $p$ is $0.01$ , the required spaces of three algorithms are all equal to 191 KB. Also, when $p = 0.05$ and $p = 0.10$ , the required spaces are almost the same. (The 1 KB difference is due to use of the ASCII mode, and the difference is eliminated if we store the data as the binary mode.) This is because the required amount of secondary storage space is proportional to the sampling rate rather than the sampling algorithm.

Table 6.

Secondary storage required in KSample, Naive UC-KSample, and UC-KSample (KB, Gaussian data).

Sampling rate	Algorithm
	KSample	Naive UC-KSample	UC-KSample $(ε = 0.7)$
$0.01$	$191$	$191$	$191$
$0.05$	$957$	$958$	$957$
$0.10$	$1915$	$1915$	$1915$
$1.00 (non - sampling)$	$20, 136$	$20, 136$	$20, 136$

Now we measure and compare the execution times of three algorithms. Figure 19 shows the execution times of KSample, UC-KSample, and Naive UC-KSample for Gaussian data. In the experiment, we set the sampling rate $p$ to $0.01, 0.05, 0.10$ and the user-given threshold $ϵ$ to $0.7$ . As shown in the figure, the execution time of UC-KSample is slightly higher than that of KSample. This is because KSample extracts samples on the element basis while UC-KSample does on the window basis, and comparisons within the window are additionally required. However, the execution time of UC-KSample is only about $1.3$ times that of KSample, so the time difference can be negligible. On the other hand, Naive UC-KSample is $2.3$ times slower than KSample. This is because Naive UC-KSample considers the entire stream every time so that it compares all elements with the already selected samples.

Figure 19.

Comparison of execution times of KSample, Naive UC-KSample, and UC-KSample (Gaussian data).

Conclusion

In this article, we proposed the UC-KSample algorithm that improved the uniformity confidence of KSample in the stream environment of sensor data. We first derived two problems of initial degradation and continuous degradation of uniformity confidence in KSample. We then defined the properties causing the initial degradation as sample range limitation and past sample invariance, and the property incurring the continuous degradation as sampling range increase. Next, we presented sample range extension, past sample change, and use of UC-window as the requirements that alleviated these problems, and proposed UC-KSample that applied these requirements to KSample. Experimental results showed that, compared with KSample, UC-KSample increased the uniformity confidence about $2.2$ times for the fixed threshold and increased it about $2.0$ times for the fixed sampling ratio. We also confirmed that UC-KSample always maintained the uniformity confidence above the user-specified threshold. The sampling accuracy test showed that the z-statistic of numerical data was between $- 0.5$ and $0.5$ , and the cosine similarity of text data was higher than $0.96$ , which meant that the sample generated by UC-KSample reflected the characteristics of the population very well. These results indicate that our UC-KSample is a novel sampling algorithm that improves the uniformity confidence of KSample without sacrificing the sampling accuracy. In future work, we will focus on improving the uniformity confidence of other sampling algorithms such as stratified sampling.

Footnotes

Handling Editor: Suat Ozdemir

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partly supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2016-0-00179, Development of an Intelligent Sampling and Filtering Techniques for Purifying Data Streams) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2017R1A2B4008991).

ORCID iD

Mi-Jung Choi

References

Babcock

Babu

Datar

. Models and issues in data stream systems. In: Proceedings of the 21st ACM symposium on principles of database systems, Madison, WI, 3–5 June 2002, pp.1–16. New York: ACM.

Leskovec

Rajaraman

Ullman

. Mining of massive datasets. Cambridge: Cambridge University Press, 2014.

Jeffery

Alonso

Franklin

. A pipelined framework for online cleaning of sensor data streams. In: Proceedings of the 22nd IEEE international conference on data engineering, Atlanta, GA, 3–7 April 2006, p.140. New York: IEEE.

Ramirez-Gallego

Krawczyk

Garcia

. A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 2017; 239: 39–57.

Haas

. Data-stream sampling: basic techniques and results, data stream management. London: Springer, 2016.

Gama

Rodrigues

Lopes

. Clustering distributed sensor data streams using local processing and reduced communication. Intel Data Anal 2011; 15(1): 3–28.

Voida

Patterson

Patel

. Sensor data streams. In: Olson

Kellogg

(eds) Ways of knowing in HCI. New York: Springer, 2014, pp.291–323.

Al-Kateb

Lee

. Adaptive stratified reservoir sampling over heterogeneous data streams. Inform Syst 2014; 39(1): 199–216.

Al-Kateb

Lee

Wang

. Adaptive-size reservoir sampling over data streams. In: Proceedings of the 19th international conference on scientific and statistical database management, Banff, AB, Canada, 9–11 July 2007, p.22. New York: IEEE.

10.

Kepe

de Almeida

Cerqueus

. KSample: dynamic sampling over unbounded data streams. J Inform Data Manage 2015; 6(32): 32–47.

11.

Cormode

Duffield

. Sampling for big data: a tutorial. In: Proceedings of the 20th international conference on knowledge discovery and data mining, New York, 24–27 August 2014, p.1975. New York: ACM.

12.

Mcleod

Bellhouse

. A convenient algorithm for drawing a simple random sample. J R Stat Soc Series C 1983; 32(2): 182–184.

13.

Efraimidis

Spirakis

. Weighted random sampling with a reservoir. Inform Process Lett 2006; 97(5): 181–185.

14.

Kim

Lee

Moon

Y-S

. Variable size sampling for maintaining uniformity confidence in the stream environment. In: Proceedings of Korea computer congress, 18–20 June 2017, pp.215–217. Jeju: The Korean Institute of Information Scientists and Engineers (in Korean).

15.

Cochran

. Sampling techniques. 3rd ed.Hoboken, NJ: Wiley, 1977.

16.

Zseby

Molina

Duffield

. Sampling and filtering techniques for IP packet selection. Technical Report RFC 5475, March2009. Fremont, CA: IETF.

17.

Al-Kateb

Lee

Sean

. Reservoir sampling over memory-limited stream joins. In: Proceedings of the 19th international conference on scientific and statistical database management, Banff, AB, Canada, 9–11 July 2007, p.23. New York: IEEE.

18.

Ting

. Adaptive threshold sampling and estimation. arXiv, August2017, https://arxiv.org/abs/1708.04970

19.

Open Data Portal in Korea. https://www.data.go.kr/main.do?lang=en, 2017.

20.

Twitter. https://twitter.com, 2017.

21.

Raghunathan

Prateek

Ravishankar

. Learning mixture of Gaussians with streaming data. In: Proceedings of advances in neural information processing systems, Long Beach, CA, 8 December 2017, pp.6608–6617. Curran Associates, Inc.

22.

Mann

. Introductory statistics. 9th ed.Hoboken, NJ: Wiley, 2016.

23.

Kim

S-P

Gil

M-S

Kim

. Efficient two-step protocol and its discriminative feature selections in secure similar document detection. Sec Comm Networks 2017; 2017: 6841216.

24.

Stavrianou

Andritsos

Nicoloyannis

. Overview and semantic issues of text mining. SIGMOD Record 2007; 36(3): 23–34.

Variable size sampling to support high uniformity confidence in sensor data streams

Abstract

Keywords

Introduction

Related work

Background

Uniformity confidence

KSample

Example 1

Uniformity confidence analysis of KSample

UC-KSample

Analysis of problem causes

Example 2

Example 3

Requirements for UC-KSample

Lemma 1

Proof

Example 4

UC-KSample algorithm

Example 5

Experimental evaluation

Experimental environment and data

Experimental results on uniformity confidence

Experimental results on sampling accuracy

Experimental results on secondary storage and execution time

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References