Entry G5: Projecting Bimodal to Unimodal

9 minute read

I need to understand graph structure better and the repercussions of using the different model types. Specifically, I’m interested in memory use, processing speed, and index optimization for the ~50 different graph metrics and algorithms I’ve explored for machine learning features.

As mentioned in Entry G2 I’m using the Marvel Universe Social Network as a jumping in point. This data is handy because:

  • It’s public
  • It’s easy to load into a Neo4j graph database
  • It easily fits either a bimodal or unimodal structure
  • It’s small enough that I can store multiple versions of it on a laptop

Database VersioningPermalink

The bimodal graph structure is the one that I work with for my job (technically it’s multimodal, but I tend to think of it as bimodal). During the course of my work, I’ve started to wonder several things:

  • Is it necessary to project a bimodal graph to a unimodal graph to run the graph algorithms?
  • Is it easier to use a projected unimodal graph when looking to engineer features for machine learning?
  • Is it faster to use a bimodal or unimodal structure (historically I have trouble with timeout errors)?
  • Can a unimodal and bimodal version of the graph exist in the same space and still be usable?

In order to really explore the ramifications of the different graph models (see Entry G3 for more on graph modeling), I created three versions of the Marvel Universe Social Network. The three versions are:

  1. Bimodal
  2. Weighted projected unimodal
  3. Mixed bimodal and projected unimodal

ContextPermalink

Why create three different versions? Why not just use the mixed graph and have it all? There are a couple reasons.

First, I frequently get a mismatch between what I expect and the actual results. As such, I tend to test things from multiple angles before trusting my results.

During these tests, I’ve found that not all functions do what I think they will. For example, I ran the gds.alpha.degree.stream() function from the Graph Data Science Library. The Degree Centrality doc page states “Degree centrality measures the number of incoming and outgoing relationships from a node.”

I used the count() function (which allowed me to easily hone in on different relationship populations: hero-to-comic or hero-to-hero or the combination of both - correct results below) to double check the results and found that the number of relationships wasn’t accurately reflected in the numbers the function returned.

Side note, if you don’t know Cypher and don’t understand the queries or syntax, don’t worry too much. I’ll go into more detail once we get into the metrics and running actual queries. If you want or need to know right now, check out the Introduction, Syntax, Clauses, and Functions sections of the Neo4j Cypher Manual.

Hero-to-hero degree countPermalink

MATCH (h:Hero)-[r]-(o:Hero)
RETURN h.name, count(r) as h_degree
Order by h_degree desc

Top 5 results:

Hero Degree Count
"CAPTAIN AMERICA" 1919
"SPIDER-MAN/PETER PARKER" 1754
"IRON MAN/TONY STARK" 1566
"THING/BENJAMIN J. GR" 1448
"MR. FANTASTIC/REED R" 1416

Hero-to-comic degree countPermalink

MATCH (h:Hero)-[r]-(o:Comic)
RETURN h.name, count(r) as h_degree
Order by h_degree desc

Top 5 results:

Hero Degree Count
"SPIDER-MAN/PETER PARKER" 1577
"CAPTAIN AMERICA" 1334
"IRON MAN/TONY STARK" 1150
"THING/BENJAMIN J. GR" 963
"THOR/DR. DONALD BLAK" 956

Hero-to-all degree countPermalink

MATCH (h:Hero)-[r]-(o)
RETURN h.name, count(r) as h_degree
Order by h_degree desc

Top 5 results:

Hero Degree Count
"SPIDER-MAN/PETER PARKER" 3331
"CAPTAIN AMERICA" 3253
"IRON MAN/TONY STARK" 2716
"THING/BENJAMIN J. GR" 2411
"HUMAN TORCH/JOHNNY S" 2298

Second, I want multiple versions of the graph to see if it’s easier and faster to structure it one way vs another. One of the major road blocks I’ve encountered at work is that it takes forever to run some of the queries and algorithms I want. These problem queries end up timing out or temporarily crashing the graph. As would be expected, our developers and software engineers get rather testy when that happens.

Unimodal ProjectionPermalink

ContextPermalink

The purpose of projecting a bimodal graph to a unimodal structure is to directly connect nodes of interest.

As a reminder from Entry G3, here’s what it looks like when we take a bimodal graph and project it to a unimodal graph:

Bimodal GraphPermalink

Projected Unimodal GraphPermalink

Cliques CaveatPermalink

One thing to keep in mind when projecting a bimodal network into a unimodal structure is that you’ll get a lot of cliques.

Clique: a subset of nodes where every distinct node is connected to every other distinct node.

There are four cliques from our example above.

To make this concept clearer, let’s look at an example from a single comic.

I pulled the comic W2 50 and the heroes in it. Some statistics about this subgraph:

  • There is 1 comic
  • There are 9 heroes
  • Total of 10 nodes
  • There are 9 relationships

When we project this to a unimodal structure, the number of relationships multiples significantly. The statistics for the projected version:

  • There are 0 comics (we removed it so we could connect heroes directly)
  • There are 9 heroes
  • Total of 9 nodes
  • There are 36 relationships

Another caveat to this method is that projecting the relationships makes certain assumptions about the connectivity of the other nodes. For example, in the W2 50 comic example, Silver Fox may not have actually met some of the other characters. Wikipedia says she is the former love interest for Wolverine. The only X-man she interacted with could have been Wolverine, but in projecting the bimodal graph to a unimodal graph we are making the assumption that all nodes (heroes) interacted with each other because they were in the same comics.

I’m not quite sure what the repercussions of these caveats are for the metrics I want and the populations I’m trying to find, but that’s one of the questions I’m trying to answer in this series of entries.

Create the Mixed ModelPermalink

The raw data for the Marvel Universe Social Network is essentially stored as a bimodal graph, so that was my starting point. If you remember, we loaded the Marvel data into a bimodal graph in Entry G2.

Now, you could go through all the steps in Entry G2 two more times to create the base graph, or you could simply clone what you already did.

1. Clone the bimodal graphPermalink

Click on the dots in the upper right corner of the database on the My Project page and choose Clone.

Caution: When you click Clone it looks like nothing is happening. Give it a minute and a new database named DBMS should appear in your list of databases.

2. Rename the databasePermalink

The default database name is DBMS. To change it simply select the database, which will bring up the Details panel. Click the pencil icon near where it says DBMS and type whatever name you want (unsurprisingly, I named mine “Marvel Universe Mixed”).

3. Add weighted edges to the graphPermalink

Start the newly cloned graph and enter the following code into the Neo4j command line:

Call apoc.periodic.iterate('MATCH (h1:Hero)-->(:Comic)<--(h2:Hero) where id(h1) < id(h2) RETURN h1, h2',
'MERGE (h1)-[r:KNOWS]-(h2) on CREATE SET r.weight = 1 on MATCH SET r.weight = r.weight+1', {batchSize:5000, parallel:false, iterateList:True});

4. Admire your new mixed model graphPermalink

If you want to take it for a test spin, go to the section below titled “Code for Bimodal/Unimodal Examples” (below the “Resources” section) and try out the code to find the subsets I used for the examples.

Create the Projected Unimodal ModelPermalink

1. Clone the mixed graphPermalink

Since the mixed model graph has both bimodal and unimodal nodes and relationships, we can just start from there.

Just like before, click on the dots in the upper right corner of the appropriate database (this time the mixed graph instead of the bimodal graph) on the My Project page and choose Clone.

2. Rename the databasePermalink

The default database name will still be DBMS. To change it select the database, which will bring up the Details panel. Click the pencil icon near where it says DBMS and type whatever name you want (unsurprisingly, I named mine “Marvel Universe Unimodal”).

3. Remove bimodal relationshipsPermalink

To create a truly unimodal graph, we need to remove the comics and all the relationships that connect to the comic nodes.

Start the newly cloned graph (make sure your other graphs are stopped or the new one won’t start). Then enter the following code into the Neo4j command line:

MATCH (c:Comic)
DETACH DELETE c;

The DETACH DELETE command conveniently deletes the selected nodes (in this query, everything with the Comic label) and all the relationships connected to those nodes.

4. Admire your new unimodal model graphPermalink

Next UpPermalink

Analysis Metrics

ResourcesPermalink

Code for Bimodal/Unimodal ExamplesPermalink

If you want to play with this data yourself, here are the steps I followed to get the subsets and send them to Gephi.

Note, when locating the subsets I used the mixed model which has both the hero-to-hero connections AND the hero-to-comic connections.

Dark CrawlerPermalink

  1. Find a small subset

To find a small subgraph for the first example, I did a degree count (in this case the degree count will return the number of comics that the hero appears in according to our dataset), then picked a name from the list.

The query has the condition that the hero must be in more than 4 comics. From the results, I picked “DARK CRAWLER” because it was close to the top and it looked vaguely familiar and was easy to spell (I wasn’t really interested in typing “YASHIDA, MARIKO MU” every time I wanted to run a query).
MATCH (h:Hero)-[r]-(o:Comic)
WITH h.name as h_name, count(r) as h_degree
WHERE h_degree > 4
RETURN h_name, h_degree
Order by h_degree
  1. Send Hero Dark Crawler’s hero connections to Gephi
MATCH path = (h1:Hero {name: 'DARK CRAWLER'})-[:KNOWS]-(h2)
CALL apoc.gephi.add(null, 'workspace1',path,'weight') yield nodes
RETURN *
  1. Send Hero Dark Crawler’s comic-hero connections to Gephi
MATCH path = (h1:Hero {name: 'DARK CRAWLER'})-[:APPEARS_IN]-(c:Comic)-[:APPEARS_IN]-(h2:Hero)
CALL apoc.gephi.add(null, 'workspace2', path) yield nodes
RETURN *

W2 50Permalink

  1. Find a small subset

For the second example, I wanted a comic with a small number of connections but enough to make the cliques obvious, so I counted all relationships of the heroes, then scrolled to the end of the list. “SHIVA” had nine connections, which seemed like it would fit the bill (yes, there was some trial and error in choosing a hero to use).

MATCH (h:Hero)-[r]-(o)
RETURN h.name, count(r) as h_degree
Order by h_degree
  1. Send Hero Shiva’s hero connections to Gephi
MATCH path = (h1:Hero {name: 'SHIVA'})-[:KNOWS]-(h2)
CALL apoc.gephi.add(null, 'workspace1',path,'weight') yield nodes
RETURN *
  1. Send Comic connections to Gephi

This query results in the same subset as step 2, but starts from the comic name instead of the hero.

MATCH path = (c:Comic {name: 'W2 50'})-[:APPEARS_IN]-(h:Hero)
CALL apoc.gephi.add(null, 'workspace2', path) yield nodes
RETURN *