Code and Data for "Skolemising Blank Nodes while Preserving Isomorphism"

A static snapshot of the code used for the paper is available here for reproducability purposes. The code uses two packages under "Apache License 2.0": Apache CLI and Guava. The code also uses the NxParser under a New BSD licence. The most recent version of the code is released as the BLabel library on GitHub under Apache License 2.0.

The synthetic graphs used are available here in a ZIP file (these graphs were selected from here). If you wish to run these experiments locally:

Download the code and data from the locations above.
Run the class cl.uchile.dcc.skolem.cli.RunSyntheticEvaluation:
- A build.xml file is available if you wish to build a .jar (ant dist).
- Pass the root directory of the graphs with parameter -d, e.g., -d ~/skolem/test/eval/.
- You can also set a custom timeout for each test with -t, passing a value in seconds (the default is 600 seconds, per the paper).
- The heap-space set in the paper was 1G (-Xmx1G). We also set 100M for stack (-Xss100M).
The code will parse the test-cases from the files one by one and order them by class (e.g., cliques) and size. It will run each class in ascending order until either all instances of the class are finished or a test fails; then it moves onto the next class.

The real-world experiments were run over the BTC-2014 dataset, available here. Since this requires processing 4 billion triples (over a 1TB of uncompressed data), to replicate these experiments, you will need a moderate sized machine:

Follow the instructions to download the BTC-2014 dataset.
Concatenate and re-GZip all data* files (e.g., zcat */data*gz | gzip -c > data-all.nq.gz)
Use e.g., the NxParser to sort the data by context (necessary to group the triples of all documents together: java -jar -Xmx$$G nxparser-1.2.4.jar -i data-all.nq.gz -igz -o data-all.3012.nq.gz -ogz -so 3012 2> sort.log where you should make sure to replace $$ with a large amount of RAM to avoid creating too many intermediate batch files).
You can run the control experiments by calling cl.uchile.dcc.skolem.cli.Control -i data-all.3012.nq.gz -igz 2> control.log > control.std (remember to set a reasonable about of RAM; we used 30G but that much is not necessary).
For the labelling experiments, run cl.uchile.dcc.skolem.cli.ComputeCanonicalGraphs -i data-all.3012.nq.gz -igz -s $$ 2> canon.log > canon.std (replacing $$ with the ID for the hashing scheme; run with -h to see options; remember to set a reasonable about of RAM ... we used 30G but that much is not necessary).

Contact

Email: aidhog@gmail.com