skolem

Code and Data for "Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and Labelling Blank Nodes"

A static snapshot of the code used for the paper is available here for reproducability purposes. The code uses two packages under "Apache License 2.0": Apache CLI and Guava. The code also uses the NxParser under a New BSD licence. The most recent version of the code is released as the BLabel library on GitHub under Apache License 2.0.

The synthetic graphs used are available here in a ZIP file (these graphs were selected from here). If you wish to run these experiments locally:

  1. Download the code and data from the locations above.
  2. Run the class cl.uchile.dcc.blabel.cli.RunSyntheticEvaluation:
  3. The code will parse the test-cases from the files one by one and order them by class (e.g., cliques) and size. It will run each class in ascending order until either all instances of the class are finished or a test fails; then it moves onto the next class.
  4. Each run will evaluate a particular configuration. Below is a list of all the configurations run for experiments in the paper (you can run this as a batch in Windows or remove the comment lines to run as a shell in Linux: build the jar file as mentioned above, make sure it's in the dist folder relative to the script/batch, with the synthetic graphs in the eval folder).
:: Test framework (correctness checks)
java -jar -Xmx1G -Xss10M dist/blabel.jar RunSyntheticEvaluation -b 4 -d eval/ -t 600 > test-t600.tsv 2> test-t600.err
:: Label
java -jar -Xmx1G -Xss10M dist/blabel.jar RunSyntheticEvaluation -b 1 -d eval/ -t 600 -s 1 > label-t600-s1.tsv 2> label-t600-s1.err
:: Label-NoPrune
java -jar -Xmx1G -Xss10M dist/blabel.jar RunSyntheticEvaluation -b 1 -d eval/ -t 600 -s 1 -nlabel > label-t600-s1-n.tsv 2> label-t600-s1-n.err
:: DFS+Label
java -jar -Xmx1G -Xss10M dist/blabel.jar RunSyntheticEvaluation -b 2 -d eval/ -t 600 -s 1 -l 0 > both-t600-s1-dfs.tsv 2> both-t600-s1-dfs.err
:: DFS-Rand+Label
java -jar -Xmx1G -Xss10M dist/blabel.jar RunSyntheticEvaluation -b 2 -d eval/ -t 600 -s 1 -l 0 -r > both-t600-s1-dfs-r.tsv 2> both-t600-s1-dfs-r.err
:: DFS-NoPrune+Label
java -jar -Xmx1G -Xss10M dist/blabel.jar RunSyntheticEvaluation -b 2 -d eval/ -t 600 -s 1 -l 0 -r -nlean > both-t600-s1-dfs-n.tsv 2> both-t600-s1-dfs-n.err
:: BFS+Label
java -jar -Xmx1G -Xss10M dist/blabel.jar RunSyntheticEvaluation -b 2 -d eval/ -t 600 -s 1 -l 1 > both-t600-s1-bfs.tsv 2> both-t600-s1-bfs.err
	 

The real-world experiments were run over the BTC-2014 dataset, available here. Since this requires processing 4 billion triples (over a 1TB of uncompressed data), to replicate these experiments, you will need a moderate sized machine:

  1. Follow the instructions to download the BTC-2014 dataset.
  2. Concatenate and re-GZip all data* files (e.g., zcat */data*gz | gzip -c > data-all.nq.gz)
  3. Use e.g., the NxParser to sort the data by context (necessary to group the triples of all documents together: java -jar -Xmx$$G nxparser-1.2.4.jar Sort -i data-all.nq.gz -igz -o data-all.3012.nq.gz -ogz -so 3012 2> sort.log where you should make sure to replace $$ with a large amount of RAM to avoid creating too many intermediate batch files).
  4. To run the leaning/labelling experiments, the class is cl.uchile.dcc.blabel.cli.RunNQuadsTest; you can call the class with -h to get an explanation of all arguments.
  5. Below we give a shell script to run in Linux: this shows all the arguments needed to run the experiments of the paper. (Running this script all together over BTC14 would take a couple of weeks.)
# Test framework
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 4 -t 600 -e btc14/test/error/ > test.tsv 2> test.err
# Control
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 3 -t 600 > control.tsv 2> control.err
# Labelling MD5
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 1 -i data-all.3012.nq.gz -igz -s 0 -t 600 -e btc14/label-s0/ > label-s0.tsv 2> label-s0.err
# Labelling Murmur
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 1 -i data-all.3012.nq.gz -igz -s 1 -t 600 -e btc14/label-s1/ > label-s1.tsv 2> label-s1.err
# Labelling Sha1
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 1 -i data-all.3012.nq.gz -igz -s 2 -t 600 -e btc14/label-s2/ > label-s2.tsv 2> label-s2.err
# Labelling Murmur wo/ pruning
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 1 -i data-all.3012.nq.gz -igz -s 1 -t 600 -e btc14/label-s1-np/ -nlabel > label-s1-np.tsv 2> label-s1-np.err
# DFS Standard
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 0 -i data-all.3012.nq.gz -igz -t 600 -e btc14/lean-s1-dfs/ -l 0 > lean-s1-dfs.tsv 2> lean-s1-dfs.err
# DFS Random order
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 0 -i data-all.3012.nq.gz -igz -t 600 -e btc14/lean-s1-dfs-r/ -l 0 -r > lean-s1-dfs-r.tsv 2> lean-s1-dfs-r.err
# DFS wo/ pruning
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 0 -i data-all.3012.nq.gz -igz -t 600 -e btc14/lean-s1-dfs-n/ -l 0 -nlean > lean-s1-dfs-n.tsv 2> lean-s1-dfs-n.err
# BFS
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 0 -i data-all.3012.nq.gz -igz -t 600 -e btc14/lean-s1-bfs/ -l 1 > lean-s1-bfs.tsv 2> lean-s1-bfs.err
# DFS Standard + Label
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 2 -i data-all.3012.nq.gz -igz -s 1 -t 600 -e btc14/both-s1-dfs/ -l 0 > both-s1-dfs.tsv 2> both-s1-dfs.err
# DFS Standard + Label (count duplicate equiv graphs)
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 2 -i data-all.3012.nq.gz -igz -s 1 -t 600 -e btc14/both-s1-dfs-d/ -d -l 0 > both-s1-dfs-d.tsv 2> both-s1-dfs-d.err
# Label (count duplicate iso graphs)
java -jar -Xmx30G -Xss100M dist/blabel.jar RunNQuadsTest -i btc14/data-all.3012.nq.gz -igz -b 1 -i data-all.3012.nq.gz -igz -s 1 -t 600 -e btc14/both-s1-dfs-d/ -d > label-s1-d.tsv 2> label-s1-d.err
	

Contact

Email
aidhog@gmail.com


Valid XHTML 1.0 Strict