Network Analysis Tutorial - Run Analysis

Created by Steve Hoover, Modified on Wed, Aug 14, 2024 at 2:00 PM by Steve Hoover

Each model includes a sample data set (OfficeStar) that can be found under the Tutorials dropdown on the Enginius Dashboard.

The remainder of the tutorials use the Network OfficeStar data set as the starting data set.

Network Run Analysis Settings (OfficeStar Default Settings)

There are three distinct aspects that could be specified for analysis.

Network topology

From Enginius library:Enginius includes several data sets for users to gain an understanding of different types of network structures.
- Airport data: will show the "hub and spoke" structure that we intuitively know about from the nature of flight routes in the US.
- Facebook data: will reveal three large clusters, and a few smaller ones, which is typical of the types of connections we see in social networks. People belong to close communities consisting of family, friends, interest groups, and school and workplace contacts, where people within a community are all likely to know each other. At the same time, people also have some connections to distant acquaintances, who themselves belong to their own communities.
- Xoxoday data: The xoxoday data is based on links that occur within the Twitter network. Tweets and re-tweets connect people, and there are connections also between the tweets themselves (e.g., reply to a tweet). These networks also show the presence of hubs, which are either the company’s twitter handle, one of the tweets, or a prominent user in the network. These aspects are described in the xoxoday case, which can be obtained from Ivey Publishing or from Harvard Business Publishing.
From user data: select the data block that contains your network

Node clustering

The second component of the dialog box is used for extracting clusters/community structures within a network. A key characteristic of social networks is they exhibit clustering. In other words, people tend to group together in clusters such as family, friends, Facebook friends, colleagues, tennis partners, etc., where the nodes within a cluster have denser links with each other than they have with members outside the cluster. Detecting such clusters or communities could be important for understanding how various influence processes propagate through the network. In terms of analytics, community detection involves re-organizing the nodes and links in the network to help us visualize the community structure of the network (e.g., whether there is one large community and several small communities, or whether there are several moderately-sized communities).

Community detection is a complex analytical task (technically referred to as NP-hard), especially in large networks, and there are many available methods. Enginius executes two types of clustering: (1) A "greedy heuristic" method that maximizes a modularity metric defined as the difference of the empirical distribution of in-cluster links of a proposed clustering scheme and the expected distribution of in-cluster links of a clustering scheme in a randomly generated equivalent network. (2) "Hierarchical clustering," where nodes are grouped together based on the similarity of their profile of ties to other nodes. Starting with every node belonging to its own cluster, the algorithm iteratively combines the currently closest nodes to existing clusters based on a "distance metric" that specifies the closeness of nodes.

There are three distance metric options to assess how close are two nodes in the network to each other, as shown below:

Cosine similarity values range from 0 to 1 where 1 means the two nodes are perfectly similar in that they have the same neighboring nodes, and 0 means they have no common neighbors.
Euclidean distance is a metric based on the sum of the number of neighbors of each node minus two times the number of their common neighbors. The Euclidean distance metric is equal to 0 if the two nodes have the same neighbors, and its value is highest when the two nodes do not have any neighbors in common. Thus, this metric is a measure of dissimilarity, rather than similarity, and should be normalized to fall in the range of 0 to 1, where 1 indicates perfect similarity.
The Pearson correlation coefficient is a metric based on the number of common neighbors between two nodes as compared to the expected number of common neighbors in an equivalent network where the nodes are connected randomly. The value of this coefficient falls in the range -1 to 1, where -1 indicates perfect dissimilarity, and 1 indicated perfect similarity.

Network diffusion process

The third component of the dialog box is used for exploring how an influence process propagates within the specified social network. With this analysis option, you can simulate the diffusion process (e.g., spread of influence, spread of new product adoptions) through the social network. You need to specify the nature and amount of seeding, namely, identifying the influencers (seeds) in the network where the process will be initiated, say by the influencers sending messages to their followers. You also need to specify the nature of the influence process, namely the diffusion model. The choices you make here require that you have a good working knowledge of the concepts and theories related to influencer marketing and diffusion models. These options are explained next.

Simulate network diffusion: checking this boxes opens up the options for network diffusion.
- Manual seeding: Specify the exact set of nodes where the diffusion process will be initiated.
- Automatic seeding: There are four options for selecting the seeds: (1) Selecting nodes randomly from the set of all nodes in the network. (2) Selecting nodes that have the highest degree centrality (see below), which is equivalent to selecting influencers who have the most followers. (3) Selecting nodes that have the highest closeness centrality (see below), and (4) Selecting nodes that have highest betweenness centrality of the nodes (see below). Typically, it is first useful to run the analysis without diffusion to identify the influencers who are likely to work out the best for a specific application.
  - Percentage seeded (%): Specify the % of the network that will be selected for seeding (usually this will be a small number such as 1%).
- Diffusion model: There are two different options available here. (1) The influence process is structured as in the Bass diffusion model (see Bass Forecasting module in Enginius) mimicking how a disease propagates through a population. (2) The threshold diffusion model attempts to explain how crowd behaviors evolve and is based on the notion that influence propagates through a node only when the degree of influence experienced by that node exceeds a node-specific threshold. In both processes, influence only spreads among nodes that are directly connected to each other. And, both influence processes are specified by two parameters p and q, and both processes accommodate the possibility of dropout (equivalent to someone dying or getting vaccinated, and thereby not propagating a disease).

Your Enginius report can be generated in many different formats. Clicking the globe beside the Run button will allow you to select a new report format.

After selecting the desired model options, click to generate the report in the desired format.