Digital Magic: August 2017

The setting

One of our customer's products is a Play2 application written in Scala on top of Lightbend Reactive Platform. The application takes advantage of a graph database – TitanDB 1.0.0 in embedded mode, backed by Cassandra.

TitanDB allows fast and convenient queries across entities bound by relations (e.g. social networks).

The application uses Titan's Java API to manipulate and query the graph.

The problem

At some point, we started noticing that certain queries were running really sluggishly – this was accompanied by Titan log messages such as this one:

Query requires iterating over all vertices [(name = user001 AND ~label = agent)]. For better performance, use indexes

This was strange, as our application did check that the necessary indexes existed – and did create them if they weren't – at its start-up. The code looked something like this:

It was a surprise: a query that was expected to use an index actually did not. For the purposes of testing, I have manually ran a hand-crafted query that was definitely supposed to use that index and ... got the same result. Interesting.

Scanning through index documentation did not shed more light (well, there actually were hints at my problem on that page, but I've missed them then). So what do software developers do when RTFM doesn't help? Right, Google it. I found a number of articles on StackOverflow, and I realized that TitanDB indexes had a lifecycle. I've used the 'RTFS' magical trick (read those sources :)) to get help:

INSTALLED - The index is installed in the system but not yet registered with all instances in the cluster
REGISTERED - The index is registered with all instances in the cluster but not (yet) enabled
ENABLED - The index is enabled and in use
DISABLED - The index is disabled and no longer in use

So it's possible our indexes weren't enabled. The next step was to understand the statuses of our indexes – this can be done via the Gremlin console (there is also a programmatic way to do that from Scala):

mgmt = g.getGraph().openManagement();
names = [ "CompositeNameSalesperson", "CompositeNameAgent" ]; // <-- all indexes are here, actually
res = names.collect { [ it , mgmt.getGraphIndex(it).getIndexStatus(mgmt.getPropertyKey('name')) ] };
mgmt.commit();
res

so we got something like this:

[
"[CompositeNameSalesperson, ENABLED]",
"[CompositeNameAgent, INSTALLED]"
]

Well-well, one of indexes is not enabled (and effectively is not used). And this is a new type of vertex, added few days ago.

The fix

Let's postpone finding the reason for the index being stuck in this INSTALLED lifecycle for a while. First order of business, I needed to repair the index by promoting it into the ENABLED state.

Since my index is INSTALLED, the next state should be REGISTERED. Let's go:

mgmt = g.getGraph().openManagement();
mgmt.updateIndex(mgmt.getGraphIndex("CompositeNameAgent"), com.thinkaurelius.titan.core.schema.SchemaAction.REGISTER_INDEX);
mgmt.commit();

Now I checked the state of the index again:

mgmt = g.getGraph().openManagement();
names = [ "CompositeNameAgent" ];
res = names.collect { [ it , mgmt.getGraphIndex(it).getIndexStatus(mgmt.getPropertyKey('name')) ] };
mgmt.commit();
res

and got:

[
"[CompositeNameAgent, INSTALLED]"
]

What ?! Why? Turned verbose logging on and I got:

Some key(s) on index CompositeNameAgent do not currently have status REGISTERED: name=INSTALLED

The next iteration of Seq(documentation, Google, StackOverlow).foreach(_.read) has yielded the following list of possible reasons due to which changing the state of the index might fail:
- open transactions
- stalled Titan instances

I was pretty sure nobody used Titan at that moment, so I went the second path to investigate. Hopefully, the Titan API can respond with enough information:

mgmt = g.getGraph().openManagement();
y = mgmt.getOpenInstances();
mgmt.commit();
y;

and the response is:

[ac1100041-hostname(current), ac1100041-hostname, ac1100021-hostname]

Hmm, so why are there 3 instances if I knew that only one application instance was running? Let's consult the documentation: "... TitanFactory can also be used to open an embedded Titan graph instance from within a JVM-based user application ..." – this would explain the existence of 2 instances (we're a using the factory twice) but what about the 3rd one?

Continuing reading the documentation about failures I got at my scenario:

However, some schema related operations - such as installing indexes - require the coordination of all Titan instances. For this reason, Titan maintains a record of all running instances. If an instance fails, i.e. is not properly shut down, Titan considers it to be active and expects its participation in cluster-wide operations which subsequently fail because this instances did not participate in or did not acknowledge the operation.

In this case, the user must manually remove the failed instance record from the cluster and then retry the operation.

Bingo! That's explains the reason for our inability to change the status of the index. The "ac1100021-hostname" instance seems to be the black sheep here. Let's get rid of it:

toRemove = ["ac1100021-hostname"];
mgmt = g.getGraph().openManagement();
x = mgmt.getOpenInstances();
toRemove.collect{ mgmt.forceCloseInstance(it) };
mgmt.commit();
y = mgmt.getOpenInstances();
[x, y];

that operation returned:

[
"[ac1100041-hostname(current), ac1100041-hostname, ac1100021-hostname]",
"[ac1100041-hostname(current), ac1100041-hostname]"
]

Now I have repeated the operations stated at the begging of this chapter and I got:

[
"[CompositeNameAgent, REGISTERED]"
]

I did that! Finally let's enable index...

mgmt = g.getGraph().openManagement();
mgmt.updateIndex(mgmt.getGraphIndex("CompositeNameAgent"), com.thinkaurelius.titan.core.schema.SchemaAction.ENABLE_INDEX);
mgmt.commit();

...and verify the result:

mgmt = g.getGraph().openManagement();
names = [ "CompositeNameAgent" ];
res = names.collect { [ it , mgmt.getGraphIndex(it).getIndexStatus(mgmt.getPropertyKey('name')) ] };
mgmt.commit();
res

we finally get:

[CompositeNameAgent, ENABLED]

I've checked the original query (which was suppose to use the index) - and it ran as fast as other queries. Problem solved! But wait a minute! Game is not over.

Root causes

So why did the index get stuck in the 'INSTALLED' state in the first place? (also I recall seeing some of the indexes being stuck in the 'REGISTERED' state)

I end up with 2 possible reasons:

a. In case an application instance (with an embedded Titan instance) crashes, it remains intact in Titan bookkeeping as a cluster member – this dead instance prevents index state from progressing, as it can not be communicated with. The following log records can be observed in this case:

Some key(s) on index CompositeNameAgent do not currently have status ENABLED: name=REGISTERED
...
Timed out (PT1M) while waiting for index CompositeNameAgent to converge on status ENABLED

An advice: make sure there are no dangling Titan cluster members - clear them our prior to creating indexes (or doing any other cluster-wide operation).

b. An issue with the code which creates indexes:

val index = management.buildIndex(indexName, classOf[Vertex]).addKey(key).indexOnly(label).unique().buildCompositeIndex()
management.setConsistency(index, ConsistencyModifier.LOCK)

Reading through Titan documentation and StackOverflow articles, I have found that this call returns as soon as the index has entered the INSTALLED state. To be sure that the index has actually entered the ENABLED state (and hence is actually ready to be used by queries), it is required to wait for index creation completion outside the transaction where it was created, and make sure no other transactions bother us. So the index setup code (see the beginning of the article) was changed to:

05 August 2017

TitanDB: why doesn't my index work ?

The setting

The problem

The fix

Root causes