Clustering on heterogeneous IoT information network based on meta path

Abstract

As the Internet and Internet of Things (IoT) continue to develop, Heterogeneous Information Networks (HIN) have formed complex interaction relationships among data objects. These relationships are represented by various types of edges (meta-paths) that contain rich semantic information. In the context of IoT data applications, the widespread adoption of Trigger-Action Patterns makes the management and analysis of heterogeneous data particularly important. This study proposes a meta-path-based clustering method for heterogeneous IoT data called I-RankClus, which aims to improve the modeling and analysis efficiency of IoT data. By combining ranking with clustering algorithms, the PageRank algorithm was used to calculate the intraclass influence of objects in the network. The HITS algorithm then transfers the influence to the core objects, thereby optimizing the classification of objects during the clustering process. The I-RankClus algorithm does not process each meta-path individually, but instead integrates multiple meta-paths to enhance the interpretability and clustering performance of the model. The experimental results show that the I-RankClus algorithm can process complex IoT datasets more effectively than traditional clustering methods and provide more accurate clustering outcomes. Furthermore, through a detailed analysis of meta-paths, this study explored the influence and importance of different meta-paths, thereby validating the effectiveness of the algorithm. Overall, the research presented in this paper not only improves the application effects of HINs in IoT data analysis but also provides valuable methods and insights for future network data processing.

Keywords

Heterogeneous information network meta-path internet of things ranking clustering

Introduction

The digital era is characterized by an explosive growth in data quantity and connectivity demands owing to advancements in Internet of Things (IoT) technologies.¹ However, despite these advancements, the integration of heterogeneous IoT systems remains a significant barrier, leading to inefficiencies and underutilization of data potential. This study is motivated by the need to overcome these integration challenges to facilitate seamless data flow and interpretation across various IoT platforms. By developing an improved semantic interoperability framework, this study aims to enable a more comprehensive and effective utilization of IoT systems, enhancing both technical outcomes and user experiences. With this motivation, our approach leverages advanced methodologies to address key gaps in current research and paves the way for practical IoT applications in diverse domains.

The proliferation of the IoT across various sectors produces a deluge of real-time data, bringing to fruition the vision envisioned by Ashton² in 1999—a world in which every electronic object is interconnected with a unique digital identity. Defined in 2005 as an omnipresent and dynamic network,³ IoT now seamlessly integrates into daily life, profoundly influencing it with applications stretching from autonomous driving to smart homes. Notably, the Trigger-Action Pattern (TAP) has emerged as a distinctive IoT application⁴ enabling cross-domain interactions via user-customized Recipes, integrating the functionalities of disparate items. Such innovations have simplified and enhanced Recipe creation and usage, as evidenced by platforms such as IFTTT⁵ and Integromat.⁶ The rapid evolution of IoT has led to a surfeit of sensors and intelligent devices, embedding complex data into daily routines and forming intricate systems that complicate the search for specific resources. Consequently, efficient and intelligent data mining methods are indispensable for extracting meaningful information from the vast, cluttered, and noisy datasets of IoT.

As the application of the IoT and its TAP continue to expand, a vast quantity of TAP data is generated,⁷ encompassing complex data sourced from the real world and composed of various types of objects and interactions. Utilizing the network structure as a fundamental unit for data processing ensures data integrity, forming what is known as a Heterogeneous Information Network (HIN).^8,9 Heterogeneous Information Networks integrate multiple object types and their interactive data, effectively managing the complex data derived from TAP.¹⁰ In studying HINs, researchers have discovered that cluster analysis of these networks can uncover valuable structural information.^11,12 To enhance the trustworthiness of the clustering results, it is imperative to account for the interaction information between different object types, thus preventing the loss of data.

Traditional clustering methods rely mainly on the attribute features of objects, dividing networks into multiple subgraphs based on the features of adjacent objects, and grouping similar objects into the same clusters. The Learning with Local and Global Consistency algorithm¹³ builds a similarity matrix based on network objects, propagates class label information until a global state is achieved, and then clusters objects with high similarity. Spectral clustering^14,15 constructs a graph of network objects and assigns weights to edges based on object distances to cluster the graph, ensuring minimal weights between clusters and maximal weights within a cluster. DBSCAN, a density-based clustering method,¹⁶ considers each object as a cluster center, extending it to attract low-density surrounding objects and is effective for noise-affected data to extract clusters of various shapes and sizes.

As interest in HINs has grown owing to their complexity and the diverse object types they contain, traditional methods for analyzing homogeneous networks have become increasingly inadequate. These conventional approaches struggle to capture the rich structural nuances present in HINs, as they often fail to account for the varied and complex connections between different types of data. To address these challenges, methods such as Semantic-Path Nonnegative Matrix Factorization (SPNMF)¹⁷ have been developed. The SPNMF calculates object similarity within HINs using a similarity matrix for clustering and incorporates a reliability matrix to regularize the matrix decomposition process. However, while SPNMF introduces significant advancements, it does not fully adapt to the dynamic structures typical of IoT environments, which are a common context for HINs. Similarly, graph-based methods, such as the GN algorithm¹⁸ and its more efficient variant, the Fast GN algorithm¹⁹ by Newman, prioritize modularity and the optimization of intracluster connections. These algorithms focus on enhancing cluster cohesion and separation but can oversimplify complex intertype relationships. Such simplifications can lead to the loss of the crucial structural insights necessary for a deep understanding of HINs.

Sun combined ranking algorithms with clustering to enhance each iteration. RankClus^20–22 integrates clustering with ranking, providing rational cluster results, while NetClus,²³ based on RankClus, treats networks as star-shaped, clustering central objects based on ranking. However, it is limited to star-shaped HINs and requires prior knowledge of representative objects. To address these issues, the ENetClus algorithm²⁴ was proposed, which considers the relationships between same-type nodes in clustering. PopRank²⁵ ranks objects in HINs based on knowledge propagation and assigns popularity propagation factors to different connections. In sparse path HINs, link prediction²⁶ can be used to obtain the effective path information.

Research on IoT data processing is increasingly gaining traction as the heterogeneity and interoperability challenges of IoT data complicate application development and data management. The increasing research on IoT data processing, due to its heterogeneity and interoperation difficulties, utilizes technologies such as linked data²⁷ and ontologies.²⁸ Ganzha and others achieved interoperability across multidomain IoT platforms, identifying ontologies useful for cross-domain development.²⁹ Chen et al. proposed a recommendation model mapping social relations of items in low-dimensional spaces for recommendations.³⁰ Noura and others designed a semantic framework for automatically extracting key topics from IoT-related literature.³¹ Shakya et al. developed an IoT-based ontology model that enhances road safety by harnessing wireless networks for more effective traffic management.³² Elgazzar et al. examined the integral contribution of IoT to smart city infrastructures, addressing prominent challenges such as security and privacy.³³ Zhuang et al. pioneered the use of a particle swarm optimization algorithm to facilitate the integration of heterogeneous sensor data, thereby improving platform interoperability in the IoT landscape.³⁴

Recently, the TAP has been widely applied in IoT scenarios, allowing users to edit TAP rules (Recipes) based on their needs. Research on TAP focuses on syntactic analysis and application of Recipes. Research in this area primarily focuses on syntactic analysis and the application of Recipes, with Liu et al. developing a neural network model that translates Recipes into operational programs using natural language constructs,³⁵ alongside enhancements to user interfaces to simplify Recipe creation.³⁶ Efforts by Corno et al.³⁷ and Jiang et al.³⁸ to annotate Recipes with semantic web technologies and detect rule chains within Recipes have significantly improved user interaction and system capabilities, yet the application of HIN models for extracting structural data from TAP remains underexplored. Meanwhile, notable advancements in HIN clustering methods by El-Kishky et al.,³⁹ Zhao et al.,⁴⁰ and Liu et al.⁴¹ have addressed personalized recommendation systems and drug–disease interaction predictions within HINs but face challenges with complex, sparse networks and restricted capabilities in harnessing semantic details from meta-paths. These developments underscore the need for more robust methods that can fully exploit the intricate and dynamic relationships within HINs, thus motivating our research to develop a more effective clustering approach that addresses these challenges.

In response to the growing complexity of HINs within the rapidly evolving IoT landscape, this study proposes a novel clustering method, I-RankClus. I-RankClus is designed to model and analyze the intricate structure of IoT data through the lens of a HIN, leveraging the distinctive TAP dataset. This approach significantly advances the field by employing path information within the HIN to refine the network's architecture, thus addressing a critical gap in existing methodologies. The primary objective of this study is to introduce and validate I-RankClus, which overcomes the limitations of existing approaches by utilizing a meta-path-based framework tailored for IoT data. This method aims to enhance clustering accuracy by integrating multiple meta-paths⁴¹ that capture the complex and dynamic relationships between different types of objects within HINs, improve the interpretability of clustering results by leveraging the unique properties of each meta-path, and offer a scalable solution that adapts dynamically to the evolving nature of data and its interactions within IoT environments.

I-RankClus sets itself apart using meta-paths as a structural foundation for calculating the influence of different object types within HINs. By integrating multiple meta-paths, it captures the nuanced structure of HINs more accurately than single meta-path models, thereby significantly enhancing traditional methods. This study further validates this approach through comprehensive experimental tests using the IoT TAP dataset, demonstrating the effectiveness of I-RankClus in clustering and partitioning heterogeneous IoT data into meaningful clusters, thus confirming its potential to profoundly impact the processing and analysis of such data.

Our contributions can be summarized as follows:

Introduction of I-RankClus, a pioneering clustering method tailored for heterogeneous IoT data, which leverages the unique structure of HINs.

Implementation of a novel approach that combines multiple meta-paths for influence calculation, providing a comprehensive representation of the network's structure.

Extensive experimental validation demonstrating the superior performance of our method in clustering heterogeneous IoT data, marking a significant advancement in the field.

The structure of the remainder of this paper is organized as follows: Methods details the methodology behind I-RankClus, including the theoretical foundation and mechanics of the algorithm. Results presents the experimental setup, data description, and the results of our comprehensive testing, highlighting the effectiveness of the method. Finally, Conclusions discuss the implications of our findings and suggest directions for future research.

Methods

To achieve optimal clustering outcomes and enhance the network structure in the IoT, this section begins by initializing the dataset and constructing a bipartite information network grounded on central-type and attribute-type objects.

Below is a table of key symbols used throughout “Methods” section to help clarify the descriptions of our network model (Table 1):

Table 1.

List of symbols and their descriptions.

Symbol	Description
HIN	Heterogeneous Information Network
IoT	Internet of Things
TAP	Trigger-Action Pattern
I-RankClus	Proposed Clustering Method
S	Set of Services in the IoT network
C	Set of Channels in the IoT network
T	Set of Triggers in the IoT network
A	Set of Actions in the IoT network
M	Influence matrix in HIN
P	Meta-path in HIN
$S_{R a n k}$	Influence ranking of Services
$C_{R a n k}$	Influence ranking of Channels
SRS	Service-Recipe-Service Meta-path
SCS	Service-Channel-Service Meta-path
CSRSC	Channel-Service-Recipe-Service-Channel Meta-path
CSCSC	Channel-Service-Channel-Service-Channel Meta-path

Note: This table elucidates the symbols utilized to represent various elements of the bipartite information network structure, facilitating a clearer understanding of the network model described.

We computed the influence of both attribute-type and central-type objects using different meta-paths and integrated these single meta-paths into the influence ranking calculations, enabling us to obtain an influence ranking for the integrated meta-paths that encapsulates rich structural information. This ranking is then leveraged to refine the subsequent clustering efforts, with iterative computations performed until satisfactory cluster results are achieved.

Heterogeneous IoT information networks

Homogeneous information networks often fail to represent the complexities of network structures, prompting researchers to focus on HINs. An example of such a network is shown in Figure 1, which illustrates a HIN comprising three types of entities: conferences (V), authors (A), and papers (P). It includes two types of connections: papers presented at conferences (P-V) and papers authored by authors (A-P).

Figure 1.

Structure diagram of a heterogeneous information network.

Concept of HINs: Consider an information network $S = (A, R)$ comprising a set of entity types $A = {A}$ and a set of relations $R = {R}$ . Define a graph $G = (V, E)$ where V is a set of entities and E is a set of connections, with mapping function $f : V \to A$ and a connection mapping function $g : E \to R$ . Here, A and R represent the defined sets of entities and connections, respectively. If $| A | + | R | > 2$ , such networks are classified as HINs.

Concept of dual-type information networks: Given two sets of entity types $X = {x_{1}, x_{2}, \dots, x_{m}}$ and $Y = {y_{1}, y_{2}, \dots, y_{n}}$ if $V = X \cup Y$ and $E \subseteq V \times V$ , then the graph $G = (V, E)$ is termed a dual-type information network, based on types X and Y.

Meta-path framework

In HINs, diverse types of entities are interconnected through various linkages. These linkages form pathways, known as meta-paths, which encapsulate distinct structural data of the network and may be depicted as sequences of binary relations between pairs of entities. Within a specified network $S = (A, R)$ , a meta-path is represented as $A_{1} \overset{R_{1}}{\to} A_{2} \overset{R_{2}}{\to} \dots \overset{R_{i}}{\to} A_{i + 1}$ or abbreviated as $A_{1} A_{2} \dots A_{i + 1}$ . For example, in a scholarly network in which two authors, $A_{1}$ and $A_{2}$ , collaborate on paper P, their cooperative relationship is captured by the meta-path $A_{1} P A_{2}$ . Thus, meta-paths are critical in managing entity relationships within HINs and revealing the underlying structural insights of the network.

Furthermore, Meta-paths are pathways that connect diverse objects, encapsulating their informational relationships.^42,43 In HINs, two objects can be linked by multiple meta-paths, each carrying distinct semantics. For instance, within these networks, while both SRS and SCS have attribute type S as their start and end points, the semantics they embody differ: SRS represents a complete process established via a Recipe, whereas SCS denotes a connection between two attributes of type S via the same Channel. In the context of the IFTTT website, Recipes are crafted by users, indicating that the SRS meta-path can signify the impact of user behaviors on the network structure. Conversely, Channels are service interfaces offered by third parties, meaning the SCS meta-path can reflect the influence of external providers on the network's architecture. Recognizing the significance of this influence offers valuable insights, particularly when determining how these connections impact the overall network.

Building on this understanding, our study explores how influence is quantified within a single meta-path. Within these structured paths in a HIN, substantial hidden information can be uncovered, thus aiding the computation of object influence.⁴⁴ Drawing inspiration from the PageRank algorithm,⁴⁵ which suggests that entities linked to high-influence nodes gain significant influence themselves, our analysis utilizes the PageRank concept to integrate objects connected via meta-paths into our comprehensive influence calculation framework. For instance, within the SRS meta-path, if an attribute-type object establishes a connection with another through a Recipe, the influence of the associated attribute objects is factored into the object's influence score, allowing high-influence objects to directly affect the influence of associated entities. Upon defining a single meta-path P, the influence of attribute-type objects calculated based on meta-path P is denoted by $S_{p} R a n k$ . The $S_{p} R a n k$ values were computed using the PageRank algorithm to rank the influence of attribute-type objects. The computational approach involves dividing the $S_{p} R a n k$ value of attribute-type object $S_{j}$ by the number of associated attribute-type objects, then distributing the averaged $S_{p} R a n k$ value to connected attribute-type objects $S_{i}$ , and finally summing the received $S_{p} R a n k$ values to determine $S_{i}$ .

The equation for $S_{p} Rank$ calculation, shown in equation $(1)$ , indicates that an attribute object's influence increases when it is connected to multiple attribute objects with high $S_{p} R a n k$ values:

[S_{p} Rank (S_{i}) = \sum_{j = 1, S_{j} \in N_{P}}^{n} \frac{S_{p} R a n k (S_{j})}{T (S_{j})}]

(1)

Here,

N_{P}

represents the set of attribute-type objects connected to

S_{i}

through meta-path P,

S_{p} R a n k (S_{j})

denotes the

S_{p} R a n k

value of attribute-type object

S_{j}

connected to

S_{i}

, and

T (S_{j})

represents for the number of attribute-type objects connected to

S_{j}

To calculate the initial influence of attribute-type objects, we used the ratio of the number of meta-paths connected to the object to the total number of meta-paths in the entire network as the object's initial influence. The calculation method is as follows:

[Rank (S_{i}) = \sum_{j = 1, s_{j} \in G}^{n} \frac{N_{P} (S_{i})}{N_{P} (S_{j})}]

(2)

In this equation, G denotes the subgraph to which object

S_{i}

belongs, and

N_{P} (S_{i})

signifies the number of meta-paths connected to

S_{i}

under the current meta-path P.

However, because some objects in the subgraph have no connecting paths to other objects, using equation $(2)$ can result in an zero influence. Consequently, this study introduces a resistance coefficient d to smoothen the influence calculation by incorporating the resistance coefficient as follows:

[S_{p} Rank (S_{i}) = \frac{1 - d}{Z} Rank (S_{i}) + d \sum_{j = 1, S_{j} \in N}^{n} \frac{S_{p} R a n k (S_{j})}{T (S_{j})}]

(3)

In this equation, Z represents the total number of attribute-type objects, and

Rank (S_{i})

is the initial influence of attribute-type object

S_{i}

Influence Calculation Based on the Integration of Multiple Meta-Paths Given the diversity of meta-paths in HINs,⁴⁶ relying solely on a single meta-path for calculations can lead to the omission of an object's structural information on other meta-paths, resulting in incomplete information. To avoid this issue, this study employed a linear weighting method⁴⁷ to integrate multiple single meta-paths for influence calculation. The equation for calculating the influence based on the integration of multiple meta-paths is represented as:

[S R a n k (S_{i}) = \sum_{p_{i \in P}} w_{p_{i}} * S_{p_{i}} R a n k (S_{i})]

(4)

In equation

(4)

p_{i}

indicates selected single meta-paths,

w_{p_{i}}

is the path weight, and

S_{p_{i}} R a n k (S_{i})

represents the influence calculated using the single meta-path

p_{i}

The magnitude of the path weights reflects the importance of the current meta-path in the calculation of influence, with the sum of all path weights equal to one, that is $\sum_{p_{i \in P}} w_{p_{i}} = 1$ .

When calculating the influence of attribute-type object S, individual meta-paths such as SRS and SCS can be selected for weighted integration SRS + SCS, maintaining the influence exerted by both users and third parties on the calculation. The choice of different path weights $w_{p_{i}}$ also affects the final result of the influence calculation.

In the selected dataset, a central-type object consists of multiple attribute-type objects. Attribute-type objects S can connect through the meta-paths SRS and SCS, and because of the inclusion relationship between attribute-type objects S and central-type objects C, extending these meta-paths yields the central-type object's meta-paths, CSRSC and CSCSC. Consequently, the integrated meta-path CSRSC + CSCSC can be derived.

Upon constructing a HIN G based on central and attribute-type objects, a matrix M is established to represent the relationships between objects within network G, which is expressed as follows:

\begin{matrix} [M = (\begin{matrix} M_{C C} \\ M_{S C} \end{matrix} \begin{matrix} M_{C S} \\ M_{S S} \end{matrix})] \end{matrix}

(5)

M_{C C}

denotes the connections among the central objects. According to the network structure, values

m_{C C}

within matrix

M_{C C}

represent the quantity of connections between two central objects through integrated meta-path CSRSC + CSCSC;

M_{C S}

indicates the connection between central and attribute-type objects,

M_{C S}

and

M_{S C}

are transpose matrices of each other;

M_{S S}

represents for the connections among attribute-type objects. Because the connections between attribute objects are already considered when calculating their influence, this matrix is omitted.

To fully integrate the information of the central and attribute-type objects in the calculation of the central object influence, two rules are defined:

Rule 1: The influence of each central-type object is determined by both the quantity and influence of connected attribute-type objects. The greater the number of attribute-type objects connected to the central-type object, the greater is the influence. Similarly, the higher the influence of the attribute-type objects, the greater the influence exerted upon their connected central-type objects:

\begin{matrix} [C R ank (C_{i}) = \sum_{j = 1}^{s} M_{S C} (S_{j}, C_{i}) * S R a n k (S_{j})] \end{matrix}

(6)

When a central-type object is connected to multiple high-influence attribute-type objects, the attribute-type object weights $M_{S C}$ multiply by the influence of the attribute-type objects S and accumulate to the influence of the central-type object C.

Rule 2: If a central-type object is connected to other high-influence central-type objects through integrated and meta-path CSRSC + CSCSC, the influence of this central-type object will be enhanced.

According to Rule 2, equation is improved as follows:

\begin{aligned} [C Rank (C_{i}) = & α \sum_{j = 1}^{s} M_{S C} (S_{j}, C_{i}) * S R a n k (S_{j}) \\ + (1 - α) \sum_{j = 1}^{S} M_{C C} (C_{i}, C_{j}) * C R ank (C_{j})] \end{aligned}

(7)

Here, the parameter

α \in [0, 1]

determines the weight assigned to each rule in the calculation, which can be derived from prior knowledge or the training dataset.

When a central-type object connects to other high-influence central-type objects, its influence rises due to these high-influence connections.

For subsequent calculations, the resulting influence is normalized:

[C R ank (C_{i}) \leftarrow \frac{C R ank (C_{i})}{\sum_{i^{'} = 1}^{C} C R ank (C_{i^{'}})}]

(8)

Furthermore, the influence of attribute-type objects can be transferred to central-type objects under specific operational rules, such as Rule 1, allowing the dissemination of structural information contained within attribute-type objects to central-type entities. As these central-type objects can exert influence on each other via meta-paths, the calculation of their influence incorporates the structural data of both attribute-type and central-type objects. This integration of influence across different object types through meta-paths reveals the underlying structural insights and interdependencies within the network.

Trigger-Action patterns in IoT

The TAP is an application-oriented programming paradigm currently employed in the IoT whereby users can specify objectives using rules, obviating the need for writing complex programming codes. This allows individuals without programming expertise to effectively utilize them. Platforms such as IFTTT⁴⁸ and Zapier⁶ have demonstrated the viability of TAP,⁴⁹ and the data utilized in this study originated from the IFTTT website.

IFTTT supports users in crafting workflows known as Recipes, by exploiting provided service interfaces. Triggers initiate and Actions execute predefined tasks automatically when user-specified Triggers are satisfied. These Triggers and Actions rely on services published by other websites on IFTTT, referred to as Channels. To create a Recipe, a user selects a Channel, chooses a service within to act as the Trigger, and similarly specifies an Action from the Channels available. Users can edit Recipes in natural language, with the Trigger, Action, and Recipe combining to form a complete task.

When creating a task, users must select a Channel to access the necessary services. However, they may not know which Channel contains the required services. Owing to the inherent ambiguity of natural language, search attempts often do not yield satisfactory results. Furthermore, as the number of Channels and similar service offerings increases, query results may become confusing, increasing the cost of search.⁵⁰ Consequently, clustering Channels can reduce search costs. All Channels on the IFTTT website are provided by third parties, and manually annotating data is labor-intensive and fails to effectively reflect the network's structural information. Thus, an appropriate clustering method is required to obtain the structural information of the network. This study presents a clustering approach for IoT TAP datasets that can economize network processing and aid in optimizing the network structure.

Research objectives and data model

Following the discussion on the impact of meta-paths in revealing hidden relationships and structures within the network, we elaborate on the specific network structure adopted for our study. This study constructs a bipartite information network structure based on experimental datasets from the IFTTT IoT platform. The defined bipartite network structure G = (V,E,W) categorizes node types into central types and attribute types. The central type is denoted as ${Channel}$ . and is represented by the symbol C, which contains $| C |$ objects. The attribute types are ${Trigger, Action}$ , represented by T and A, respectively, and they comprise $| T |$ and $| A |$ objects. Here, $C = c_{1}, c_{2}, \dots, c_{| C |}$ , $T = t_{1}, t_{2}, \dots t_{| T |}$ , and $A = a_{1}, a_{2}, \dots, a_{| A |}$ . Each Trigger belongs to a Channel, as do each Action, and a Channel may contain one or more Triggers and Actions. Owing to the similar attributes and network structure between Triggers and Actions, they are consolidated into a singular attribute type, Service, denoted by S and with $| S | = | T | + | A |$ . The set M denotes the weighting between the central and attribute type objects, where $M_{ij} \in M$ represents the weight of the edge $(x_{i}, x_{j})$ . The following definition is given: if $x_{i} (x_{j}) \in S$ and $x_{j} (x_{i}) \in C$ , and here is an edge between $x_{i}$ and $x_{j}$ , then $M_{i j} = 1$ ; otherwise, $M_{i j} = 0$ .

Figure 2 illustrates the constructed bipartite network, highlighting the connections between central type objects (Channels) and attribute type objects (Triggers and Actions), which are unified under the Services category. This visual representation clarifies the interaction framework within our bipartite structure, setting the stage for computational analysis that follows.

Figure 2.

CS bipartite heterogeneous information network.

Building on this structured network model, we introduced the I-RankClus algorithm through a detailed pseudocode representation. The purpose is to encapsulate our computational framework and outline the algorithmic steps that are essential for executing the proposed meta-path-based clustering method on heterogeneous IoT data networks (Table 2).

Table 2.

Pseudocode for I-RankClus.

Algorithm: I-RankClus
Input: Network $G = ⟨ C, S; M ⟩$ Ranking function f Number of clusters K Output: K clusters of $C, C_{i}$ Conditional ranking functions based on $C_{i} : {\vec{r}}_{C_{i} \| C_{i}}, {\vec{r}}_{S \| C_{i}}$ for i = 1, 2, …, K Procedure: Step 1: Initialization: t = 0 Initial partition of $C, C_{i}^{(t)} = C$ Step 2: Conditional sorting for each cluster: For i = 1 to K: Generate subgraph $G_{i}^{(t)}$ using $C_{i}^{(t)}$ and S $({\vec{r}}_{C_{i} \| C_{i}}^{(t)}, {\vec{r}}_{S \| C_{i}}^{(t)}) = f (G_{i}^{(t)})$ ${\vec{r}}_{C \| C_{i}}^{(t)} = M_{C S} {\vec{r}}_{S \| C_{i}}^{(t)}$ End Step 3: Compute mixed probability model: Predict parameters $Θ$ using the mixed probability model, obtain probability vector ${\vec{s}}_{c_{i}} = (β_{i, 1}, β_{i, 2}, \dots, β_{i, k})$ for each object $c_{i}$ , and compute the centroid for each cluster: For k = 1 to K: ${\vec{s}}_{C_{k}}^{(t)}$ = centroid of cluster $C_{k}^{(t)}$ End Step 4: Adjust clustering: For each object $c_{i}$ in C: For i = 1 to K: Compute the distance $D (c_{i}, C_{k}^{(t)})$ from $c_{i}$ to the centroid of cluster $C_{k}^{(t)}$ End Assign $c_{i}$ to cluster $C_{k_{m}}^{(t + 1)}$ , where $k_{m} = a r g m i n_{k} D (c_{i}, C_{k}^{(t)})$ Step 5: Repeat steps 2, 3, and 4 until convergence End Procedure

Algorithm: I-RankClus

Input:

Network $G = ⟨ C, S; M ⟩$

Ranking function f

Number of clusters K

Output:

K clusters of $C, C_{i}$

Conditional ranking functions based on $C_{i} : {\vec{r}}_{C_{i} | C_{i}}, {\vec{r}}_{S | C_{i}}$ for i = 1, 2, …, K

Procedure: Step 1: Initialization:

t = 0

Initial partition of $C, C_{i}^{(t)} = C$

Step 2: Conditional sorting for each cluster:

For i = 1 to K:

Generate subgraph $G_{i}^{(t)}$ using $C_{i}^{(t)}$ and S

$({\vec{r}}_{C_{i} | C_{i}}^{(t)}, {\vec{r}}_{S | C_{i}}^{(t)}) = f (G_{i}^{(t)})$

${\vec{r}}_{C | C_{i}}^{(t)} = M_{C S} {\vec{r}}_{S | C_{i}}^{(t)}$

End

Step 3: Compute mixed probability model:

Predict parameters $Θ$ using the mixed probability model, obtain probability vector ${\vec{s}}_{c_{i}} = (β_{i, 1}, β_{i, 2}, \dots, β_{i, k})$ for each object $c_{i}$ , and compute the centroid for each cluster:

For k = 1 to K:

${\vec{s}}_{C_{k}}^{(t)}$ = centroid of cluster $C_{k}^{(t)}$

End

Step 4: Adjust clustering:

For each object $c_{i}$ in C:

For i = 1 to K:

Compute the distance $D (c_{i}, C_{k}^{(t)})$ from $c_{i}$ to the centroid of cluster $C_{k}^{(t)}$

End

Assign $c_{i}$ to cluster $C_{k_{m}}^{(t + 1)}$ , where $k_{m} = a r g m i n_{k} D (c_{i}, C_{k}^{(t)})$

Step 5: Repeat steps 2, 3, and 4 until convergence End Procedure

This pseudocode outlines the core algorithmic steps undertaken by the I-RankClus method to cluster heterogeneous IoT data. By leveraging the influence scores calculated from the meta-paths, the algorithm effectively partitions the network into meaningful clusters that reflect the complex, underlying structure of the IoT data.

Following the pseudocode, it is crucial to understand the computational demands of the algorithm, especially as the scale of the data increases. The complexity analysis below outlines the time complexity involved in each phase of the algorithm, providing insights into the scalability and efficiency of I-RankClus when applied to large-scale IoT networks:

Ranking Phase: The time complexity for the ranking component is $O (t_{1} | ε |)$ , where $| ε |$ denotes the number of connections, and $t_{1}$ is the number of iterations required for convergence.

Mixed Model Estimation: This phase involves calculating the conditional probabilities for each link within the clusters, leading to a time complexity of $O (K | ε |)$ , where K is the number of clusters.

Clustering Adjustment: Calculating the distance between each entity $c_{i}$ and each cluster $C_{k}$ , with each entity represented in a K-dimensional space, results in a time complexity of $O (m K^{2})$ , where m is the number of entities.

Combining these components, the overall time complexity of the I-RankClus algorithm across all iterations is

O (t (t_{1} | ε | + t_{2} (K | ε |) + m K^{2}))

, where [ t ] is the total number of iterations for the algorithm and

t_{2}

is the number of iterations for the mixed model estimation. In sparse networks, this complexity approximates a linear relationship with the number of entities, indicating an efficient scalability as the network grows.

Ranking and clustering methodology

Previous steps computed the influence of attribute-type and central-type objects based on meta-path information, successfully identifying key nodes with substantial influence. This foundation allows for the next phase of our methodology, in which we cluster these influential central-type nodes. Within each cluster, we employ the previously computed influence scores to establish ranking mechanisms—specifically intraclass ranking and conditional ranking. These concepts facilitate a layered analysis of node significance, thereby enhancing the granularity and relevance of our clustering outcomes.

Given the graph $G = (C, S)$ , for a subclass $C^{'} \subseteq C$ , we define the subgraph $C^{'} = ⟨ C^{'} \cup S, M^{'} ⟩$ as induced from G. The conditional ranking for S, denoted as ${\vec{r}}_{S | C^{'}}$ , and the intra-class ranking for $C^{'}$ , denoted as ${\vec{r}}_{C^{'} | C^{'}}$ , is defined as the ranking outcome of function f on subgraph $C^{'}$ : $({\vec{r}}_{C^{'} | C^{'}}, {\vec{r}}_{S | C^{'}}) = f (G^{'})$ . The conditional ranking on graph G, denoted as ${\vec{r}}_{C | C^{'}}$ , is defined by the transferred scores of ${\vec{r}}_{S | C^{'}}$ on G as follows:

[{\vec{r}}_{C | C^{'}} (C_{i}) = \frac{\sum_{i = 1}^{S} M_{C S} (C_{i}, S_{i}) {\vec{r}}_{S | C^{'}} (S_{i})}{\sum_{i = 1}^{S} \sum_{j = 1}^{C} M_{C S} (C_{j}, S_{i}) {\vec{r}}_{S | C^{'}} (S_{i})}]

(9)

Assuming that an initial partition of the central-type object collection C is known and that the conditional ranking scores for attribute-type and central-type objects within each partition have been computed, the subsequent goal is to leverage these conditional ranking scores to enhance the clustering outcomes. When clusters are appropriately delineated, the ranking scores of objects within a cluster should be distinct from those of other clusters, indicating that these scores can define new features, leading to improved clustering. In practical applications, for each cluster

C^{'}

, the conditional ranking scores

{\vec{r}}_{C^{'} | C^{'}}

for C and

{\vec{r}}_{S | C^{'}}

for S can be considered as the characteristic features of cluster

C^{'}

With the clustering results of the central-type objects established as $C_{1}, C_{2}, \dots, C_{K}$ and the computed conditional ranking scores for each cluster as ${\vec{r}}_{S | C_{k}} (k = 1, 2, \dots, K)$ and ${\vec{r}}_{C | C_{k}} (k = 1, 2, \dots, K)$ for $k = 1, 2, \dots, K$ , we denote them as follows:

[p_{k} (C) = {\vec{r}}_{C | C_{k}}, p_{k} (S) = {\vec{r}}_{S | C_{k}}]

(10)

For every object

c_{i}

in C

(i = 1, 2, \dots, C)

, we define

p_{c_{i}} (S) = p (S | c_{i})

as the probability of

c_{i}

connecting to an object in S via an edge. Subsequently, we establish a hybrid probability model that encompasses K distributions. Letting

β_{i, k}

represent the posterior probability of

c_{i}

belonging to the

k^{t h}

class, we model

p (S | c_{i})

[p_{c_{i}} (S) = \sum_{k = 1}^{K} β_{i, k} p_{k} (C), \sum_{k = 1}^{K} β_{i, k} = 1]

(11)

Because

β_{i, k}

is the posterior probability of

c_{i}

in the

k^{t h}

class,

p (S | c_{i})

is inferred by Bayes’ theorem⁵¹:

p (k | c_{i}) \propto p (c_{i} | k) p (k)

, with

p (c_{i} | k)

already computed as the conditional ranking score of

c_{i}

in class k. Hence, the next goal is to estimate the prior probability

p (k)

Let $Θ$ represent the parameters of the hybrid probability model, where $Θ$ is a $C \times K$ matrix: $Θ_{C \times K} = {β_{i, k}} (i = 1, 2, \dots, C; k = 1, 2, \dots, K)$ . The next step involves estimating parameters $Θ$ based on the observed network values, for which we employ the EM algorithm:

(1) E Step

Introduce the latent variable $z \in {1, 2, \dots, K}$ , which represents the category membership of edge $c, s$ . The log-likelihood function is derived as follows:

We introduce a latent variable z that varies from ${1, 2, \dots, K}$ to represent the category membership of each edge $c, s$ . Here, $M_{C S}$ is a matrix in which each entry $(c_{i}, s_{j})$ corresponds to the count of observations of edge $c_{i}, s_{j}$ . Latent class membership z influences the generation of these edges under the model parameter $Θ$ .

The log-likelihood function $\log L (Θ | M_{C S}, Z)$ can be expressed across all observed data points as follows:

[\log L (Θ | M_{C S}, Z) = \log \prod_{i = 1}^{C} \prod_{j = 1}^{S} {(p (c_{i}, s_{j}, z) | Θ)}^{M_{C S} (c_{i}, s_{j})}]

(12)

Expanding the joint probability $p (c_{i}, s_{j}, z)$ using the chain rule of probabilities:

[p (c_{i}, s_{j}, z | Θ) = p (z | Θ) p (c_{i}, s_{j} | z, Θ)]

(13)

Assuming independence between $c_{i}$ and $s_{j}$ given z, we can factorize $p (c_{i}, s_{j}, z)$ as:

[p (c_{i}, s_{j} | z, Θ) = p_{z} (c_{i} | Θ) p_{z} (s_{j} | Θ)]

(14)

Thus, substituting back into the log-likelihood function,

[\log L (Θ | M_{C S}, Z) = \sum_{i = 1}^{C} \sum_{j = 1}^{S} M_{C S} (c_{i}, s_{j}) \log (p_{z} (c_{i} | Θ) p_{z} (s_{j} | Θ) p (z | Θ))]

(15)

This breaks down further to:

[\log L (Θ | M_{C S}, Z) = \sum_{i = 1}^{C} \sum_{j = 1}^{S} M_{C S} (c_{i}, s_{j}) [\log p_{z} (c_{i} | Θ) + \log p_{z} (s_{j} | Θ) + \log p (z | Θ)]]

(16)

where

p_{z} (c_{i} | Θ)

and

p_{z} (s_{j} | Θ)

represent the probabilities of

c_{i}

and

s_{j}

generated from class z under the parameters

Θ

. Assuming that the initial parameter values

Θ^{0}

follow a uniform distribution with

β_{i, k} = \frac{1}{K}

, we can express the expected value of the log-likelihood function as:

Given that the initial parameter values $Θ^{0}$ follow a uniform distribution with $β_{i, k} = \frac{1}{K}$ , we want to express the expected value of the log-likelihood function. We start with:

[Q (Θ, Θ^{0}) = E_{f (Z | M_{C S}, Θ^{0})} (\log (Θ | M_{C S}, Z)) = \sum_{k = 1}^{K} \sum_{i = 1}^{C} \sum_{j = 1}^{S} M_{C S} (c_{i}, s_{j}) \log (p_{k} (c_{i}, s_{j}) p (z = k | Θ)) p (z = k | c_{i}, s_{j}, Θ^{0})]

(17)

Breaking it down further gives:

[Q (Θ, Θ^{0}) = \sum_{k = 1}^{K} \sum_{i = 1}^{C} \sum_{j = 1}^{S} [M_{C S} (c_{i}, s_{j}) \log (p (z = k | Θ)) p (z = k | c_{i}, s_{j}, Θ^{0}) + M_{C S} (c_{i}, s_{j}) \log (p_{k} (c_{i}, s_{j})) p (z = k | c_{i}, s_{j}, Θ^{0})]]

(18)

Here, $M_{C S} (c_{i}, s_{j})$ is the number of observations of edge $⟨ c_i, s_j ⟩$ . We separately calculate the probability of class k influencing this edge and the probability of the edge parameters themselves, both factored by how likely it is for the edge to belong to class k under the initial parameter assumptions $Θ^{0}$ . The conditional distribution in the aforementioned equation, $p (z = k | c_{i}, s_{j}, Θ^{0})$ , can be computed using Bayes’ theorem:

[p (z = k | c_{i}, s_{j}, Θ^{0}) \propto p (c_{i}, s_{j} | z = k, Θ^{0}) p (z = k | Θ^{0})]

(19)

With the uniform distribution assumption for $Θ^{0}$ , the prior probability $p (z = k | Θ^{0})$ is $\frac{1}{K}$ . Assuming independence within class k, we have:

[p (c_{i}, s_{j} | z = k, Θ^{0}) = p_{k}^{0} (c_{i}) p_{k}^{0} (s_{j})]

(20)

Substituting back:

[p (z = k | c_{i}, s_{j}, Θ^{0}) \propto p_{k}^{0} (c_{i}) p_{k}^{0} (s_{j}) \frac{1}{K}]

(21)

This equation can be interpreted as:

[p (z = k | c_{i}, s_{j}, Θ^{0}) \propto p_{k}^{0} (c_{i}) p_{k}^{0} (s_{j}) p^{0} (z = k)]

(22)

where $p^{0} (z = k) = \frac{1}{K}$ . Here, $p_{k}^{0} (c_{i})$ and $p_{k}^{0} (s_{j})$ represent the probabilities of $c_{i}$ and $s_{j}$ within class k, derived from the initial parameters $Θ^{0}$ . The term $p^{0} (z = k)$ represents the uniform prior probability of class k. This completes the derivation of the proportional relationship for $p (z = k | c_{i}, s_{j}, Θ^{0})$ .

(2) M-step

In the M-step of the Expectation-Maximization algorithm, the main objective is to maximize the auxiliary function $Q (Θ, Θ^{0})$ , which represents the expected value of the log-likelihood and assists in determining the distribution of $p (z = k)$ . This optimization process is achieved by incorporating a Lagrange multiplier⁵² to incorporate the constraint that the sum of the probabilities must equal one:

[\frac{\partial}{\partial p (z = k)} [Q (Θ, Θ^{0}) + λ (\sum_{k = 1}^{K} p (z = k) - 1)] = 0]

(23)

By setting the derivative of the auxiliary function with respect to (p(z = k)) to zero, and rearranging the terms, we arrive at equation $(24)$ :

[\sum_{i = 1}^{C} \sum_{j = 1}^{S} M_{C S} (c_{i}, s_{j}) \frac{1}{p (z = k)} p (z = k | c_{i}, s_{j}, Θ^{0}) + λ = 0]

(24)

This equation provides a way to iterate toward an optimal value of $p (z = k)$ that maximizes the expected log-likelihood under the current parameter estimates. Upon computation, it is feasible to ascertain new estimates for $p (z = k)$ given the initial values $Θ^{0}$ :

[p (z = k) = \frac{\sum_{i = 1}^{C} \sum_{j = 1}^{S} M_{C S} (c_{i}, s_{j}) p (z = k | c_{i}, s_{j}, Θ^{0})}{\sum_{i = 1}^{C} \sum_{j = 1}^{S} M_{C S} (c_{i}, s_{j})}]

(25)

Finally, we applied Bayes’ theorem to calculate the parameter $β_{i, k}$ within $Θ$ :

[β_{i, k} = p (z = k | c_{i}) = \frac{p_{k} (c_{i}) p (z = k)}{\sum_{l = 1}^{K} p_{l} (c_{i}) p (z = l)}]

(26)

The computation process described above is reiterated until the parameter matrix $Θ$ converges.

Following the estimation of parameters $Θ$ using the EM algorithm, we utilize the learned parameters within our hybrid probability model to predict the values for each object $c_{i}$ . Each object was then represented by a K-dimensional vector ${\vec{s}}_{c_{i}} = (β_{i, 1}, β_{i, 2}, \dots, β_{i, k})$ . Based on these representations, we calculate the centroid for each cluster by averaging the vectors ${\vec{s}}_{c_{i}}$ of all objects within that cluster, thus facilitating a refined analysis of our clustering method:

[{\vec{s}}_{C_{k}} = \frac{\sum_{c \in C_{k}} \vec{s} (c)}{| C_{k} |}]

(27)

The cardinality of the $k^{t h}$ cluster, denoted by $| C_{k} |$ , signifies the number of objects it contains. Subsequently, we calculated the distance $D (c, C_{k})$ from each object to every cluster centroid, defined as the complement of the cosine similarity between object c and cluster $C_{k}$ :

[D (c, C_{k}) = 1 - \frac{\sum_{l = 1}^{K} {\vec{s}}_{c} (l) {\vec{s}}_{C_{k}} (l)}{\sqrt{\sum_{l = 1}^{K} {({\vec{s}}_{c} (l))}^{2}} \sqrt{\sum_{l = 1}^{K} {({\vec{s}}_{C_{k}} (l))}^{2}}}]

(28)

After calculating these distances, each object c is strategically assigned to the cluster $C_{k}$ to which it is closest in terms of this distance metric. This assignment ensures that objects are grouped based on their similarity, optimizing the internal coherence of each cluster and thus enhancing the practical relevance and interpretability of our clustering outcomes.

Results

Experimental dataset

This study employs the publicly available IFTTT (If This Then That) dataset, which captures the TAP from the popular IoT platform, for the enrichment of research in this domain.⁵³ We invite the scientific community to access and utilize this dataset, available on the official IFTTT website or through research collaborations, to replicate our findings and extend research within this field.

The dataset encompasses 397 central entities, known as Channels, which are service providers, and 1988 attribute entities representing the Triggers and Actions, collectively termed Services. These components constitute a bipartite information network, illustrating the multilateral relationships between Channels and Services—where each Channel can be linked to numerous Services, and each Service is connected to a singular Channel.

Given the expansive nature of the dataset and its representation of complex IoT ecosystems, the experimental scenario focused on the strategic selection of core dataset components for detailed analysis. Recognizing the challenge presented by the sheer volume and varied data quality, approximately 80 of the most influential Channels were selected based on their centrality and prominence within the network. These selected Channels reflect a diverse array of IoT applications and are pivotal in our analysis because of their significant role in the network structure and dynamics. This subset not only depicts the typical functioning of various IoT services but also highlights the intricate interactions that form the backbone of IoT systems. This careful selection enhances the efficacy and clarity of the subsequent clustering analysis, ensuring a focus on the most impactful and informative aspects of the IoT network.

With a structure that mirrors the complex interactions endemic to IoT devices and applications, the dataset delineates the varied IoT services (397 Channels) and specific actionable conditions (1988 Services) within these frameworks. The interplays among Channels and Services are depicted as network edges, encapsulating the specific TAP rules designated by IFTTT's user base.

For our analysis, we initially imported the dataset into a Neo4j graph database^54,55 to manage its intricate, interconnected data structure. Figure 3 illustrates the relationships between these entities as constructed in the Neo4j graph database. This system allows each Channel and Service to be articulated as a node, with their interrelations defined as edges. The dataset then underwent preprocessing to validate its quality and coherence by eradicating duplicates or incomplete records.

Figure 3.

Schematic diagram of IFTTT data storage in Neo4j graph database.

Experimental methodology and data clustering

The objective of this study was to conduct a cluster analysis⁵⁶ of central entities. Considering the vast quantity of entities in the dataset and the varying information quality due to certain central entity connections, approximately 80 central entities of higher value were selectively integrated into the dataset to achieve more distinct clustering results. Before delving into the specifics of the I-RankClus clustering method used for this refined dataset, it is pertinent to outline the computational environment that supports our experiments (Table 3).

Table 3.

Experimental setup specifications.

Component	Specification
Computer Model	Dell XPS 15
Processor	Intel Core i7-10750H
RAM	16 GB DDR4
Storage	512 GB SSD
Operating System	Windows 10 Pro, 64-bit
Software	Python 3.8.5, Neo4j Desktop 1.4.15
Libraries	NumPy 1.19.2, Pandas 1.1.3, Scikit-learn 0.23.2

The experimental environment detailed in Table 3 facilitates the processing and analysis of the IoT data. We then utilized the I-RankClus clustering method on the refined dataset. This approach leverages a meta-path-based framework to ascertain the influence scores of nodes, thereby enabling the categorization of Channels and Services into meaningful groups. The process from raw data to final clustering outcomes is illustrated in Figure 4, providing a detailed overview of the transformation of the datasets into a structured graph, evaluation of node influence, and application of the I-RankClus algorithm. The detailed description and availability of the IFTTT dataset highlight the transparency and reproducibility of our research, fostering further advancements in IoT data analysis.

Figure 4.

Process diagram of data processing and clustering.

Evaluation of algorithm performance

The evaluation process for the I-RankClus algorithm's performance was conducted in the same computational environment as outlined in Table 3, ensuring consistency across all tests. This structured setup involved experimental trials designed to systematically compare various configurations and settings. These trials included comparisons of different meta-paths (SRS, SCS, and their combination SRS + SCS) and varying path weightings to determine their impact on clustering quality in terms of Compactness (CP) and Separation (SP).

We utilized an established IoT dataset, processed as described previously, and performed multiple runs with each configuration to ensure robustness and variability control in our findings. The effectiveness of each configuration was further examined by altering the weights assigned to individual meta-paths, providing insights into how different information prioritization affects the clustering results, thereby allowing us to optimize the algorithm for better precision and recall.

To evaluate the performance of the algorithm, we employed CP and SP as performance metrics. Compactness represents the average distance from each object within a cluster to the centroid of that cluster; the lower the CP value, the closer the objects are within a cluster, indicating a better clustering quality. The CP value can be deduced using equations (29) and (30):

[C P_{i} = \frac{1}{n_{i}} \sum_{x ϵ C_{i}} x_{i} - u_{i}]

(29)

[C P = \frac{1}{k} \sum_{i = 1}^{k} C P_{i}]

(30)

The term

u_{i}

denotes the cluster centroid. The SP represents the average distance between distinct cluster centroids, with higher SP values indicating greater SP, and consequently, enhanced clustering performance. SP was calculated according to equation (31):

[S P = \frac{2}{k^{2} - k} \sum_{i = 1}^{k} \sum_{j = i + 1}^{k} u_{i} - u_{j}]

(31)

Comparing the performance of singular meta-paths and their composite counterparts, we set the number of clusters k to 5 and evaluated standalone meta-paths SRS and SCS, as well as their combined meta-path SRS + SCS. Table 4 presents the performance results.

Table 4.

Performance comparison of singular and composite meta-paths.

	SRS	SCS	SRS + SCS
CP	0.1245	0.1377	0.0982
SP	10.1894	9.5914	12.8817

The data in Table 4 indicate that the SRS meta-path outperforms SCS in clustering efficacy, suggesting that SRS better captures the relationship between objects and incorporates more effective path information. This performance difference implies that user behavior encapsulated by the SRS meta-path offers more pertinent insights into network structures than SCS.

Furthermore, the composite meta-path demonstrates superior performance over the singular meta-paths, achieving the lowest CP value and the highest SP value. These results attest to the ability of the composite meta-path to provide richer information and reflect network structures more effectively.

To investigate the impact of path weighting on the performance of the composite meta-path SRS + SCS, with a total path weight of 1, weights were adjusted in increments of 0.1, resulting in a variety of outcomes as detailed in Table 5, where 1pw and 2pw denote the path weights for SRS and SCS, respectively.

Table 5.

Performance comparison at various path weights for composite meta-paths.

$(w_{p_{1}}, w_{p_{2}})$	CP	SP
(0.1, 0.9)	0.1251	10.0215
(0.2, 0.9)	0.1230	10.5132
(0.3, 0.7)	0.1152	10.9154
(0.4, 0.6)	0.1047	12.0364
(0.5, 0.5)	0.0991	12.8459
(0.6, 0.4)	0.0899	13.6247
(0.7, 0.3)	0.0872	13.9566
(0.8, 0.2)	0.0951	12.0590
(0.9, 0.1)	0.1088	10.9843

The analysis in Table 5 elucidates that different path weightings distinctly affect clustering outcomes; the best performance is achieved when the weights for SRS and SCS are set to 0.7 and 0.3, respectively. This finding demonstrates that the preponderance of the SRS meta-path within the composite leads to improved clustering results. Conversely, as the weight attributed to an individual meta-path approaches 1, the composite method increasingly resembles the use of a singular meta-path, and the algorithmic performance diminishes.

To compare the accuracy of the clustering algorithms, we introduced the F-Measure metric, calculated as shown in equations (32)–(34):

[p r e c i s i o n = \frac{n_{k}^{m}}{n_{k}}]

(32)

[r e c a l l = \frac{n_{k}^{m}}{n_{m}}]

(33)

[F - m e a s u r e = \frac{2 * p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}]

(34)

The F-Measure is directly proportional to clustering performance, with higher values indicating superior results. Using the K-means, DBSCAN, and I-RankClus algorithms for cluster analysis—with the latter employing the composite meta-path SRS + SCS weighted with 0.7 for SRS and 0.3 for SCS—yields the results presented in Table 6.

Table 6.

Comparative performance of various clustering algorithms.

	K-means	DBSCAN	I-RankClus
F-Measure	0.6084	0.5829	0.6572
CP	0.1059	0.0854	0.0872
SP	12.0869	11.8812	13.9566

As shown in Table 6, the performance of the I-RankClus algorithm is clearly superior to that of both K-means and DBSCAN in terms of F-Measure and SP values. Despite a slightly lower CP value compared to DBSCAN, it still exceeded that of K-means, indicating the algorithm's superior ability to extract effective structural information from the dataset. Subsequent analysis of the experimental results of the I-RankClus algorithm reveals that selecting diverse meta-paths to compute influence distinctly affects the clustering outcomes. Meta-paths encompassing a broader array of path information consistently surpass those based on a single meta-path. Furthermore, the application of varied path weights to each individual meta-path also notably affected the results. In comparison to other clustering algorithms, I-RankClus has been proven to process datasets with greater efficacy, thereby confirming the validity of the algorithm.

Conclusions

This study introduced the I-RankClus algorithm designed specifically for dual-typed information networks, using integrated meta-paths to enhance the clustering process within the context of the IoT. Our key findings revealed that the integration of diverse meta-paths significantly improves precision and influence transmission in clustering efforts, outperforming conventional techniques such as K-means and DBSCAN, which often fail to fully contextualize the complexities inherent in IoT datasets.

However, this research has several limitations primarily concerning its applicability to static datasets. The dynamic and continuously updating nature of IoT data presents challenges that the current iteration of I-RankClus does not address. This underlines an immediate avenue for future research—to modify and adapt the I-RankClus framework to accommodate real-time data analysis, which is crucial for practical IoT applications.

To further support the I-RankClus algorithm as a robust tool for IoT data handling, future studies should explore the parameter sensitivity of the meta-path integration process to refine its adaptability and effectiveness. Moreover, implementing the algorithm in other IoT application areas, such as smart city infrastructures or healthcare systems, could provide additional insights and potentially open pathways to new clustering methodologies tailored to specific IoT challenges.

Footnotes

Acknowledgements

The authors would like to express my sincere gratitude to OpenAI's ChatGPT for its invaluable assistance in summarizing and refining the related work section in the Introduction of my paper (see Section 1). The insights and language enhancement provided by ChatGPT were instrumental in articulating complex ideas and ensuring clarity of expression. This support was pivotal in enhancing the overall quality of the manuscript.

Author Contribution Statement

Kuo Zhao: conceptualization, original draft preparation, review and editing of the manuscript, supervision of the research project, project administration, and funding acquisition.

Huajian Zhang: conceptualization, preparation of the original draft, manuscript review and editing, and project administration.

Jiaxin Li: provision of resources, visualization of data, data curation, and supervision of the research activities.

Qifu Pan: software, data curation, formal analysis of the study data, and investigation processes.

Li Lai: software, validation of results, formal analysis, data curation, and review and editing of the manuscript.

Yike Nie: visualization of data, development of the methodology, software programming, and data curation.

Zhongfei Zhang: supervision, project administration, and acquisition of funding for the research.

All authors have read and agreed to the published version of the manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Key Research and Development Program of China, Basic and Applied Basic Research Foundation of Guangdong Province, 2018 Guangzhou Leading Innovation Team Program (China), 2019 Guangdong Special Support Talent Program – Innovation and Entrepreneurship Leading Team (China), (grant number 2021YFB3301702, 2023A1515011712, 201909010006, 2019BT02S593).

ORCID iD

Kuo Zhao

Author biographies

Kuo Zhao, PhD, is an Associate Professor at the School of Intelligent Science and Engineering, Jinan University. His primary research areas include blockchain, big data intelligence, cybersecurity, cloud computing, and the Internet of Things. He has published over 100 papers, including more than 30 SCI-indexed papers, and has been cited over 610 times on Google Scholar (as of March 2020). He has led several national and provincial research projects, including those funded by the National Natural Science Foundation and key projects in Jilin Province. As a project leader, he has received a Second Prize in the Jilin Provincial Science and Technology Progress Award. As a key contributor, he has received a First Prize in the Ministry of Education's Science and Technology Progress Award, a First Prize in the Jilin Provincial Science and Technology Progress Award, and a Second Prize in the Seventh National Teaching Achievement Award. He was selected as a Top Innovative Talent in the third batch of Jilin Province and a Leading Talent in the Jilin Province Young and Middle-aged Scientific and Technological Innovation Team. In September 2019, he was invited by the Third Section of the General Office of the Central Committee of the Communist Party of China to submit policy recommendations on “From Blockchain to Distributed Ledger - Challenges and Opportunities.”

Huajian Zhang is an undergraduate student at the School of Intelligent Systems Science and Engineering, Jinan University. His research interests include artificial intelligence, deep learning, big data, and the Internet of Things. He has been involved in four major innovative projects.

Jiaxin Li, born in 2002, is an undergraduate student at the School of Intelligent Systems Science and Engineering, Jinan University. His main research interests include machine learning and computer vision.

Pan Qifu, born in 1998, is a master's degree candidate at the School of Intelligent Systems Science and Engineering, Jinan University. His main research directions include deep learning, machine learning, and large language models.

Li Lai is currently a graduate student majoring in artificial intelligence at Jinan University, Guangdong, China, since 2022. He obtained his bachelor's degree in information management from Jiangxi University of Finance and Economics. His primary research interests include privacy computing and large language models, focusing on the integration of federated learning and blockchain, as well as the theoretical and technical aspects of large language model applications.

Yike Nie is a graduate student at the School of Intelligent Systems Science and Engineering, Jinan University. His main research direction is knowledge graphs. He has participated as a project member in the National Key Research and Development Program and has won the third prize twice in provincial programming competitions.

Zhongfei Zhang, PhD, is affiliated with the School of Management, Jinan University. His publications include many articles in the journals such as Advanced Engineering Informatics, International Journal of Production Research, IET Collaborative Intelligent Manufacturing, Science Progress, Journal of Mechanical Engineering, etc. His research directions include smart manufacturing system management, production logistics synchronized control, and social manufacturing. His projects include: Guangdong Basic and Applied Basic Research Fund Project: Research on Credible Intelligent Synchronized Decision-Making Mechanism and Multi-Objective Optimization Method for “Production-Transportation-Inventory” in Industrial Parks (Project No. 2023A1515011712); National Natural Science Foundation of China (NSFC) Project: Business Meta-space Driven Hyper-cyclic Optimal-state Evolution Method for Distributed Synchronized Manufacturing System (DSMS) (Project No. 52375498); and The Fourth Batch of Xijiang Innovation Team Project in Zhaoqing: Intelligent Rechargeable Stereo Parking Garage Based on the Internet of Things, 2022.1-2024.12.

References

Qadir

Saeed

, et al. Towards 6G Internet of Things: recent advances, use cases, and open challenges. ICT Express 2023; 9: 296–312.

Ashton

. That ‘internet of things’ thing. RFID J 2009; 22: 97–114.

Baras

Brito

. Introduction to the internet of things. In: Internet of things. Boca Raton, FL, USA: Chapman and Hall/CRC, 2017, pp. 3–32.

Corno

De Russis

Monge Roffarello

How do end-users program the Internet of Things?

Behav Inf Technol. 2022;41:1865–1887.

Qian

Zhang

, et al. An empirical characterization of IFTTT: ecosystem, usage, and performance. Association for Computing Machinery, 2017, pp. 398–404.

Abdou

Ezz

Farag

. Digital automation platforms comparative study. IEEE 2021: 279–286.

Chen

Zhang

Elliot

, et al. Fix the leaking tap: a survey of Trigger-Action Programming (TAP) security issues, detection techniques and solutions. Comput Secur 2022; 120: 102812.

Sun

Han

Yan

, et al. Heterogeneous information networks: the past, the present, and the future. Proc VLDB Endow 2022; 15: 3807–3811.

Chun-bo

Ji-wen

. Review of recommendation based on heterogeneous information network. Comput Eng Sci 2023; 45: 2047.

10.

Jin

Qin

Fang

, et al. An efficient neighborhood-based interaction model for recommendation on heterogeneous graph. 2020:75-84.

11.

Forouzandeh

Berahmand

Sheikhpour

, et al. A new method for recommendation based on embedding spectral clustering in heterogeneous networks (RESCHet). Expert Syst Appl 2023; 231: 120699.

12.

Zhang

Gong

Huang

, et al. Clustering heterogeneous information network by joint graph embedding and nonnegative matrix factorization. ACM Trans Knowl Discov Data 2021; 15: 1–25.

13.

Zhou

Bousquet

Lal

, et al. Learning with local and global consistency. Adv Neural Inf Process Syst 2003; 16.

14.

Liu

Han

. Spectral clustering. In: Data clustering. Boca Raton, FL: Chapman and Hall/CRC, 2018, pp.177–200.

15.

Madhamsetty

. Approximate N-Clustering on Heterogeneous Information Networks with Star Schema. University of Cincinnati; 2023.

16.

Ester

Kriegel

H-P

Sander

, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. 1996:226–231.

17.

Aghdam

Zanjani

. A novel regularized asymmetric non-negative matrix factorization for text clustering. Inf Process Manag 2021; 58: 102694.

18.

Girvan

Newman

. Community structure in social and biological networks. Proc Natl Acad Sci U S A 2002; 99: 7821–7826.

19.

Newman

. Fast algorithm for detecting community structure in networks. Phys Rev E 2004; 69: 066133.

20.

Sun

Han

Zhao

, et al. RankClus: integrating clustering with ranking for heterogeneous information network analysis. 2009:565–576.

21.

Yamazaki

Sato

Shiokawa

, et al. Fast and parallel ranking-based clustering for heterogeneous graphs. J Data Intell 2020; 1: 137–158.

22.

Yang

Han

. Revisiting citation prediction with cluster-aware text-enhanced heterogeneous graph neural networks. 2023.

23.

Sun

Han

. Ranking-based clustering of heterogeneous information networks with star network schema. Association for Computing Machinery, 2009; vol. 1, pp. 797–806.

24.

Gupta

Aggarwal

Han

, et al. Evolutionary clustering and analysis of heterogeneous information networks. IBM Res Rep 2010: 1006–1064.

25.

Nie

Zhang

Wen

J-R

, et al. Object-level ranking: bringing order to web objects. 2005:567–574.

26.

Umer

. A framework for dynamic heterogeneous information networks change discovery based on knowledge engineering and data mining methods. University of Salford (United Kingdom); 2021.

27.

Swar

Khoriba

Belal

. A unified ontology-based data integration approach for the internet of things. Int J Electr Comput Eng 2022; 12: 2097.

28.

Shi

Pan

Jiang

, et al. An ontology-based methodology to establish city information model of digital twin city by merging BIM, GIS and IoT. Adv Eng Inf 2023; 57: 102114.

29.

Ganzha

Paprzycki

Pawłowski

, et al. Semantic interoperability in the Internet of Things: an overview from the INTER-IoT perspective. J Netw Comput Appl 2017; 81: 111–124.

30.

Chen

Zhou

Zheng

, et al. Time-aware smart object recommendation in social internet of things. IEEE Internet Things J 2019; 7: 2014–2027.

31.

Noura

Gyrard

Heil

, et al. Automatic knowledge extraction to build semantic web of things applications. IEEE Internet Things J 2019; 6: 8447–8454.

32.

Kumar Shakya

Sundar

Kushwaha

, et al. Internet of things-based intelligent ontology model for safety purpose using wireless networks. Wirel Commun Mob Comput 2022; 2022: 8.

33.

Elgazzar

Khalil

Alghamdi

, et al. Revisiting the internet of things: new trends, opportunities and grand challenges. Front Internet Things 2022; 1: 1073780.

34.

Zhuang

Huang

Liu

. Integrating sensor ontologies with niching multi-objective particle swarm optimization algorithm. Sensors 2023; 23: 5069.

35.

Liu

Chen

Shin

, et al. Latent attention for if-then program synthesis. Adv Neural Inf Process Syst 2016; 29.

36.

Thuluva

Bröring

Medagoda

, et al. Recipes for IoT applications. 2017:1–8.

37.

Corno

De Russis

Roffarello

. A semantic web approach to simplifying trigger-action programming in the IoT. Computer (Long Beach Calif) 2017; 50: 18–24.

38.

Jiang

Zhang

, et al. TapChain: a rule chain recognition model based on multiple features. Secur Commun Netw 2021; 2021: 1–11.

39.

El-Kishky

Markovich

Park

, et al. Twhin: Embedding the twitter heterogeneous information network for personalized recommendation. 2022:2842–2850.

40.

Zhao

B-W

You

Z-H

, et al. HINGRL: predicting drug–disease associations with graph representation learning on heterogeneous information networks. Brief Bioinform 2022; 23: bbab515.

41.

Liu

. A node clustering algorithm for heterogeneous information networks based on node embeddings. Multimed Tools Appl 2024; 83: 3745–3766.

42.

Forouzandeh

Rostami

Berahmand

, et al. Health-aware food recommendation system with dual attention in heterogeneous graphs. Comput Biol Med 2024; 169: 107882.

43.

Ammar

Inoubli

Zghal

, et al. Systematic literature review on Heterogeneous Information Networks. 2023.

44.

Sun

Han

. Meta-path-based search and mining in heterogeneous information networks. Tsinghua Sci Technol 2013; 18: 329–338.

45.

Rogers

. The Google Pagerank algorithm and how it works. 2002.

46.

Shang

Liu

, et al. Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. arXiv preprint arXiv:161009769 2016.

47.

Wang

Shi

, et al. Dynamic heterogeneous information network embedding with meta-path based proximity. IEEE Trans Knowl Data Eng 2020; 34: 1117–1132.

48.

Hoy

. If this then that: an introduction to automated task services. Med Ref Serv Q 2015; 34: 98–103.

49.

Rahmati

Fernandes

Jung

, et al. IFTTT vs. Zapier: A comparative study of trigger-action programming frameworks. arXiv preprint arXiv:170902788 2017.

50.

Chang

Xue

, et al. Automatic channel pruning via clustering and swarm intelligence optimization for CNN. Appl Intell 2022; 52: 17751–17771.

51.

Zheng

Zhou

. Exploiting chain rule and Bayes’ theorem to compare probability distributions. Adv Neural Inf Process Syst 2021; 34: 14993–15006.

52.

Bertsekas

. Constrained optimization and Lagrange multiplier methods. Cambridge, MA, USA: Academic Press, 2014.

53.

Haoxiang Yu

Julien

. Data from: Dataset: Analysis of IFTTT Recipes to Study How Humans Use Internet-of-Things (IoT) Devices. 2021. doi:https://doi.org/10.5281/zenodo.5572860

54.

Jabde

. Learning Graph Databases: Neo4j an overview.

55.

Scifo

. Graph Data Science with Neo4j. Birmingham, UK: Packt Publishing, 2023.

56.

Ezugwu

Ikotun

Oyelade

, et al. A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng Appl Artif Intell 2022; 110: 104743.