Entry G9: Measuring Performance

4 minute read

As I may or may not have mentioned, the last time I tried running many of the calculations that I’ll be exploring in this series, I ran into timeout errors … or crashed the cluster, which made me a persona non grata with the developer running the database.

Now that I’m trying different solutions to address these runtime problems, I need a way to compare performance of the different graph models, as well as assess the pros and cons of one over another.

The notebooks where I did my code for this entry can be found on my github page in the Entry G9: raw notebook (i.e., the one with all my failed attempts to access the metadata) and Entry G9: Global Measures notebook (i.e., the one with actual working code).

The Problem

As discussed in Entry G5, I want to compare the memory use, processing speed, and index optimization for the different iterations of graph model I come up with.

Many blog posts point out that the retrieval time is listed at the bottom of the Neo4j Browser query results.

This is great if I want to manually capture that information one query at a time. However, seeing that I have ~50 metrics and at least 3 graph models to run (not to mention iterating through different index optimizations), I really don’t want to capture that information manually.

Additionally, as referenced in Query Tuning in the Neo4j Cypher Manual, Neo4j caches query plans to help improve performance. This means that the first run of a query will be slower because the plan isn’t cached. For this reason, as well as other causes of variation in run time, I want to run a query multiple times and capture the metrics for each run. This will allow me to see the differences in run times and memory use.

The Options

I found two types of solutions to this problem:

query plans
performance monitoring

Query Plans

EXPLAIN and PROFILE both return information about the query plan.

EXPLAIN: returns the plan and expected results without actually running the query
PROFILE: actually runs the query allowing it to return the query plan, actual run times, and the results of the query

The information for both EXPLAIN and PROFILE can be accessed using the .consume() method on the Result object.

Monitoring

The other two options I found involve query logging or using the Halin Neo4j Monitoring app.

Both of these options seem more suited to production monitoring of the performance/health of the database. Halin gives realtime results on the database, but tying that information to any one query would be tricky and involve manual monitoring. What I’m trying to determine is performance of the queries themselves.

The query logging option also appears to only be available in Enterprise Edition. That’s not a problem for my work dataset, but I’m using the Community Edition to test my theories on toy datasets before moving back to the large work dataset. So the logging solution is out for these first tests.

The Proposed Solution

After spending a significant amount of time with the Python Driver Result documentation, the blog post Neo4j Python 4.0 driver— the latest driver for the next gen database, and finally giving up and talking to my software engineering friend Mohit, I was able to capture the EXPLAIN and PROFILE information for any given query.

The results of the .consume() method provides a wealth of information including:

the time it took for the server to have the result available
the time it took for the server to consume the result
database hits
global memory use
number of rows returned
time
the same information as above but for each child step to achieve the final result

The Fail

I haven’t needed this header for a while. But…

When pulling back the metadata on the query Result objects, I found that some queries have more information and some have less. Of the data points I’m interested in, not all of them are available for all queries. Specifically, queries like call apoc.meta.stats() were missing some data points I was trying to pull.

The complexity of a query also determines how many child steps it has. A query like MATCH (c) RETURN count(c) as node_count has one child entry.

This query has nine child levels:

MATCH (n) WHERE NOT (n)--() WITH COUNT(distinct n) as isolates_count MATCH ()-[r]->() WITH count(r) as relation_ct, isolates_count MATCH (c) with count(distinct c) as node_count, isolates_count, relation_ct return node_count, relation_ct, isolates_count, round(toFloat(isolates_count)/node_count*10000) / 100 as isolates_pct

If I want information on child steps, I’ll need to consume this information dynamically.

At this point, I’m not sure I even need that information. If I were working on optimizing a query (which I may very well need to do at some point), I’d want the child level data so I could optimize the steps and hone the results. However, that feels like a separate task from simply measuring how the queries perform across different graph models.

Decision

For now, I’ll keep in mind that the information I want is accessed through .consume(). I’ll collect that data for all the queries after I’ve determined how I want to capture them (i.e., once I’ve gone through the full list of analysis metrics), then narrow down the fields available for all query Result objects or determine whether I need to tailor the fields for each specific query.

That will just leave capturing the metadata for multiple runs of the query; calculating the minimum, maximum, and standard deviation of those runs; and figuring out the best way to start and stop the different graph models (or store them as different databases in the same instance, which is now possible thanks to version 4).

Up Next

Local Metrics

Resources

Twitter Facebook LinkedIn

Julie Fisher

Entry G9: Measuring Performance

The Problem

The Options

Query Plans

Monitoring

The Proposed Solution

The Fail

Decision

Up Next

Resources

You May Also Enjoy

Entry SM02: Clean Data

Entry SM01: Using S3 from AWS’s SageMaker

Entry NLP4: Frequencies and Comparison

Entry NLP3: Clean Data and Split into N-grams