Amazon Neptune - part 2

Amazon Neptune - part 2

a gentle introduction

This blog post is the second part of Neptune's introduction. You can read the first part here.

In the previous post, I left out three points.

  • Export: export data to S3.
  • Streams: capture changes to a graph as they occur.
  • Full-text search: based on Amazon OpenSearch Service and Lucene query syntax.

Export: export data to S3

As for AWS guidelines, there are several ways to export data from a Neptune DB cluster:

  • For small amounts of data, use the results of a query or queries
  • There is also a powerful and flexible open-source tool for exporting Neptune data, namely neptune-export

Because I am new on the Neptune journey, I trust AWS for the best practice, follow the instructions, and deploy the Neptune Export AWS CloudFormation template.

After the Neptune-Export installation has been completed, I can see on my account:

  • Neptune Export API (APIGW)
  • AWS Batch

To start the export with this command:

curl \
  (your NeptuneExportApiUri) \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "command": "export-pg",
        "outputS3Path": "s3://(your Amazon S3 bucket)/neptune-export",
        "params": { "endpoint": "(your Neptune endpoint DNS name)" }
      }'

I need to allow connectivity from an AWS Batch to my Neptune Cluster.

This means I need to attach the NeptuneExportSecurityGroup created by the AWS CloudFormation stack as inbound rules to the security group of my Neptune Cluster.

Without this, you will get this error:

java.util.concurrent.TimeoutException: Timed out while waiting for an available host - check the client configuration and connectivity to the server if this message persists

I have imported around 100 M of nodes, so as usual, I do not test the AWS services for the hello world scenario, and to my not surprise, I get another error:

An error occurred while counting all nodes. Elapsed time: 121 seconds
java.util.concurrent.CompletionException: org.apache.tinkerpop.gremlin.driver.exception.ResponseException: {""code"":""TimeLimitExceededException"",""detailedMessage"":""A timeout occurred within the script during evaluation."",

You need to go into the Neptune console, Parameter groups and change the parameter neptune_query_timeout from 2 minutes to much more and run the export and hope for the best, but maybe you will find another error:

An error occurred while writing all nodes as CSV files. Elapsed time: 1200 seconds

And so on until the export works.

Streams: capture changes to a graph as they occur

Why export all the data in one go when you can do the same with Streams? This is what I think, and this was my next stop after the export experience.

I was expecting out-of-the-box integration to services like:

  • AWS Lambda
  • Amazon Kinesis

It turned out that Neptune streams is nothing like that, and in reality, you need to poll the data out. For example, suppose you want to ingrate with AWS Lambda. In that case, Neptune provides a polling framework through a CloudFormation template. This framework allows you to author a stream handler and then register it with a Lambda function provided by the polling framework. Furthermore, the framework uses AWS Step Functions and DynamoDB-based workflow to schedule the host Lambda's execution and checkpoint the stream processing. For more, refer here.

I did not try it out because it will be another Hello World, and I bet that t does not cover something, and for sure, I need some customization, so I must implement my polling framework version.

Neptune integrates with Amazon OpenSearch Service and also here is not out of the box, and you need to go through a lot of configurations and enable streams to make it works.

Keep in mind that there are two clusters to maintain (Neptune and OpenSearch), increasing the costs.

Each document in OpenSearch corresponds to a node in the graph, including information about the edges, labels and properties.

Once it is all up and running, I need to change my full-text queries to use Apache Lucene query syntax, for example:

g.withSideEffect("Neptune#fts.endpoint", "es_enpoint")
  .withSideEffect("Neptune#fts.queryType", "query_string")
  .V()
  .has("*", "Neptune#fts predicates.name.value:\"Jane Austin\" AND entity_type:Book")

It is essential to read this section Full-Text-Search Query Execution in Amazon Neptune to avoid many OpenSearch calls.

Conclusion

I am disappointed with the service, not as a graph database but for the ecosystem provided by AWS.

You need a Jupyter notebook to run the query, while it would be great to have some integration with the console where I can run the query and see the results. I should be doing something other than setting up extra things.

The export service does not work out of the box, so I must amend the Neptune configuration until I make it. It is not a very good way to do it if I need to try constantly. The script takes the count of all the nodes, splits the queries, and moves it to S3. See examples here.

__.V().count() -> TAKE
__.V().range(0L,5012167L).project("~id","~label","properties").by(T.id).by(__.label().fold()).by(__.valueMap())
__.V().range(5012167L,10024334L).project("~id","~label","properties").by(T.id).by(__.label().fold()).by(__.valueMap())
__.V().range(10024334L,15036501L).project("~id","~label","properties").by(T.id).by(__.label().fold()).by(__.valueMap())
__.V().range(15036501L,-1L).project("~id","~label","properties").by(T.id).by(__.label().fold()).by(__.valueMap())

A better way is to start with the polling and exporting in this way or keep the eventual overall count. At the same time, your application inserts nodes and runs these queries using smaller ranges using a combination between AWS Step Functions and AWS Lambda. It will be cheaper.

Instead of asking customers to build their polling framework, the Neptune stream should give a few integration options.

Full-text is just integration with Open Service, which is the successor to Amazon Elasticsearch Service, and here my only critics are:

  • I don't need an extra cluster to maintain
  • It is time for a serverless search service

Another strange feature that I found is that I can stop the cluster manually from the console and save some money, but I also found this:

You can stop a database for up to seven (7) days. It will be automatically created if you do not manually start your database after seven (7) days.

I am still trying to figure this out. AWS may want me to use it.

All database services should offer something like Aurora. In addition, it would be nice to have an option where it is possible to stop the cluster if nobody uses it and automatically scale up at the first request.

I miss Serverless and all the benefits I am getting straightaway. Of course, everything cannot be serverless, but AWS should offer better products with some standard features like service integrations.