Wednesday, October 3, 2012

Provenance and Traceability in RDF with Callimachus

Provenance and Traceability in RDF with Callimachus

All too often software is designed without adequate regard for traceability. Traceability refers to the ability to audit the state of data at any point in the system for correctness and completeness; for any entity in the system all transactions that to lead to the current state and their metadata can be examined, reviewed, and verified. Software is supposed to be designed according to the stakeholders' requirements, but many of these experts take traceability for granted. Most people don't audit most of the time, but the ability to audit at all requires traceability all the time.
Consider the common scenario where a business is trying to provide some semi-automation to a business process. Often businesses are trying to move from an informal email-based process to a web-based semi-automated process. Such a move can reduce human involvement and make the process faster and more efficient, leading to greater productivity. However, few participants realize the inherit traceability of email-based processes. Moving away from email-based to web-based, without proper consideration, can kill a company's ability to audit the process for correctness and completeness.
Today most web-based systems are built using SQL databases. However, the rigid nature of SQL-based systems creates a significant barrier for adding traceability to an existing SQL-based system. Traceability is not an add-on feature; it requires deep integration into every change and every transaction. This is something many SQL-based systems cannot easily provide.
Papers on digital traceability date as far back as 1986. However a quarter of a century later, there are still no standards for tracking digital conceptual objects (as there is in many other industries for the traceability of physical objects). Furthermore, following the digital explosion of data in the past decade and the increased reliance on information from the Web, there is a growing challenge that no one seems to know whether any of this information that is being collected is accurate or not.
This may change in 2013, as the W3C has been working on a general provenance information standard since 2009 that is scheduled for release next year. Specifically it is to support the widespread publication and use of provenance information of Web documents, data, and resources. Specifically, they are defining a provenance interchange language and methods to publish and access provenance metadata using this language.
The PROV specification (currently in last call) defines things as entities, activities, and agents. Entities are physical, digital, conceptual, or any other kinds of thing. Examples of such entities are a web page, a chart, and a spellchecker. Activities are how entities come into existence and how their attributes change. Agents takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software, an inanimate object, an organization, or other entities that may be ascribed responsibility.
Callimachus 0.18 will be the first Callimachus release to use this new PROV language to seamlessly describe all the activities that take place in the system. The Callimachus project was named after the man that created the first library catalogue; so it should not be too surprising that the project continues this legacy by creating metadata about every activity performed.
When a new resource is created, metadata is stored in the triple store to record the event. These activities are stored in the RDF store in a named graph, for example, when the RDF of a create form submits the following triples:
</sun> a </callimachus/Concept> , skos:Concept ;
    skos:prefLabel "Sun" ;
    skos:definition "The great luminary" .
Additional authorization information is copied from the class and parent folder that includes:
</sun> calli:reader </group/public> ;
calli:subscriber </group/everyone> ;
calli:editor </group/staff> ;
calli:administrator </group/admin> .
Callimachus uses this authorization information as a simple authorization model that is similar to the ACL of a file system. Here the groups or users of the system are assigned authorization rights to the resource. calli:reader provides read-only access, calli:subscriber provides access to the resources history and provenance data and grants the ability to discuss or comment on the resource, calli:editor provides the ability to change the resource, and calli:administrator provides the ability to change the authorization information.
The resource is also inserted into the parent folder using the following triple:
</> calli:hasComponent </sun> .
Callimachus provides a hierarchical view of resources that mimics the path segments of their identifier. This hierarchical relationship is captured using the inverse-functional calli:hasComponent property from the parent resource to its child. The reason this is an inverse-functional relationship is to require proper authorization to change the parent resource when adding a new child resource.
Finally, all these triples are combined and stored in the RDF store in an activity graph, along with the PROV metadata of the activity itself. The prov:wasGeneratedBy is a functional property that links the resource entities to the last activity that modified it. The prov:generated/prov:specializationOf links the activity to the resource entities it modified.
GRAPH </activity/2012/11/08/t1> {

    </activity/2012/11/08/t1> a </callimachus/Activity>, audit:RecentBundle ;
        calli:reader </group/everyone>, </group/staff>, </group/admin> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> ;
        prov:wasInfluencedBy </activity/2012/11/08/t0> .

    </activity/2012/11/08/t1#provenance> a prov:Activity ;
        prov:startedAtTime "2012-11-08T15:07:22.869Z"^^xsd:dateTime ;
        prov:wasAssociatedWith </user/james> ;
        prov:generated </activity/2012/11/08/t1#!/sun> ;
        prov:generated </activity/2012/11/08/t1#!/> ;
        prov:generated </activity/2012/11/08/t1#!/activity/2012/11/08/> ;
        prov:endedAtTime "2012-11-08T15:07:24.583Z"^^xsd:dateTime .

    </activity/2012/11/08/t1#!/sun>
        prov:specializationOf </sun> .
    </sun> a </callimachus/Concept>, skos:Concept ;
        calli:administrator </group/admin> ;
        calli:editor </group/staff> ;
        calli:reader </group/public> ;
        calli:subscriber </group/everyone> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> ;
        skos:definition "The great luminary" ;
        skos:prefLabel "Sun" .

    </activity/2012/11/08/t1#!/>
        prov:specializationOf </> ;
        prov:wasRevisionOf </activity/2012/11/08/t0#!/> .
    </>
        calli:hasComponent </sun> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> .

    </activity/2012/11/08/t1#!/activity/2012/11/08/>
        prov:specializationOf </activity/2012/11/08/> ;
        prov:wasRevisionOf </activity/2012/11/08/t0#!/activity/2012/11/08/> .
    </activity/2012/11/08/>
        calli:hasComponent </activity/2012/11/08/t1> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> .
}
Modifying a resource is a bit trickier as Callimachus stores both the previous version as well as the new version of the resource. If the clients send the following update to the server:
DELETE DATA {
    </sun> skos:definition "The great luminary" .
};
INSERT DATA {
    </sun> skos:definition "The lamp of day" .
};
Three triples are removed (not just one) from all graphs in the RDF store.
DELETE DATA {
    </sun> skos:definition "The great luminary" ;
        prov:wasGeneratedBy </activity/2012/10/02/t1> .

    </activity/2012/10/02/> prov:wasGeneratedBy </activity/2012/10/02/t1> .
};
The triple is then replaced with the following to keep the semantics of the first activity intact.
INSERT DATA {
    GRAPH </activity/2012/11/08/t1> {
        </activity/2012/11/08/t1#!/sun> audit:with </activity/2012/11/08/t2#5eef4c8f> .
    } 
} 
In addition, a new named graph is created with the following, to represent this new activity.
GRAPH </activity/2012/11/08/t2> {

    </activity/2012/11/08/t2> a </callimachus/Activity> , audit:RecentBundle ;
        calli:reader </group/everyone>, </group/staff>,  </group/admin>;
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> ;
        prov:wasInfluencedBy </activity/2012/11/08/t1> .

    </activity/2012/11/08/t2#provenance> a prov:Activity ;
        prov:startedAtTime "2012-11-08T15:19:31.199Z"^^xsd:dateTime ;
        prov:wasAssociatedWith </user/james> ;
        prov:generated </activity/2012/11/08/t2#!/sun> ;
        prov:generated </activity/2012/11/08/t2#!/activity/2012/11/08/> ;
        prov:endedAtTime "2012-11-08T15:19:31.295Z"^^xsd:dateTime .

    </activity/2012/11/08/t2#!/sun> ;
        audit:without </activity/2012/11/08/t2#5eef4c8f> ;
        prov:specializationOf </sun> ;
        prov:wasRevisionOf </activity/2012/11/08/t1#!/sun> .
    </activity/2012/11/08/t2#5eef4c8f>
        rdf:object "The great luminary" ;
        rdf:predicate skos:definition ;
        rdf:subject </sun> .
    </sun>
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> ;
        skos:definition "The lamp of day" .

    </activity/2012/11/08/t2#!/activity/2012/11/08/>
        prov:specializationOf </activity/2012/11/08/> ;
        prov:wasRevisionOf </activity/2012/11/08/t1#!/activity/2012/11/08/> .
    </activity/2012/11/08/>
        calli:hasComponent </activity/2012/11/08/t2> ;
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> .
}
Callimachus also allows users to upload RDF triple files (rdf+xml and turtle). When an entire RDF triple file is uploaded the metadata stored is slightly different. If the file data.rdf is uploaded to the home folder, all the triples in the file are inserted into the named graph </data.rdf>. In addition, the following named graph is created and the binary file is stored permanently on disk associated with the same activity identifier.
GRAPH </activity/2012/11/08/t3> {

</activity/2012/11/08/t3> a </callimachus/Activity>, audit:RecentBundle ;
    calli:reader </group/everyone>, </group/staff>, </group/admin> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> ;
    prov:wasInfluencedBy </activity/2012/11/08/t1> ;
    prov:wasInfluencedBy </activity/2012/11/08/t2> .

</activity/2012/11/08/t3#provenance> a prov:Activity ;
    prov:startedAtTime "2012-11-08T15:36:40.039Z"^^xsd:dateTime ;
    prov:wasAssociatedWith </user/james> ;
    prov:generated </activity/2012/11/08/t3#!/data.rdf> ;
    prov:generated </activity/2012/11/08/t3#!/> ;
    prov:generated </activity/2012/11/08/t3#!/activity/2012/11/08/> ;
    prov:endedAtTime "2012-11-08T15:36:40.951Z"^^xsd:dateTime .

</activity/2012/11/08/t3#!/data.rdf> ;
    prov:specializationOf </data.rdf> .

</data.rdf> a </callimachus/NamedGraph>, sd:NamedGraph, foaf:Document ;
    calli:administrator </group/admin> ;
    calli:editor </group/staff> ;
    calli:reader </group/public> ;
    calli:subscriber </group/everyone> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> ;
    dcterms:identifier "data" .

</activity/2012/11/08/t3#!/>
    prov:specializationOf </> ;
    prov:wasRevisionOf </activity/2012/11/08/t2#!/> .

</>
    calli:hasComponent </data.rdf> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> .

</activity/2012/11/08/t3#!/activity/2012/11/08/>
    prov:specializationOf </activity/2012/11/08/> ;
    prov:wasRevisionOf </activity/2012/11/08/t2#!/activity/2012/11/08/> .

</activity/2012/11/08/>
    calli:hasComponent </activity/2012/11/08/t3> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance>  .
}
All of this metadata is readily available in the history tab or by following the rel=version-history link in the page, other atom feed, or Link header in an OPTIONS response. The metadata is formatted as an html list or as an atom feed. Both of these representations include links to the PROV activity that modified the resource.
These named metadata activity graphs together provide an audit trail of all the entities in the system in a transparent way, linked together with the common prov:used relationships. This allows the software developers and the stakeholders to focus on their value-added features.
More information about Callimachus can be found at the project page at http://callimachusproject.org/.

References for the post include:
Tim Berners-Lee, W3C Chair, Web Design Issues, September 1997
John Sheridan, UK National Archives, data.gov.uk, February 2010
Jill Mesirov, Chief Informatics Officer of the MIT/Harvard Broad Institute, in Science, January 2010
Luc Moreau, University of Southampton, in The Foundations of Provenance on the Web, November, 2009
Vinton Cerf, Internet pioneer, in Smithsonian's "40 Things you need to know about the next 40 years" issue, July, 2010
Jeff Jarvis, media company consultant and associate professor at the City University of New York's Graduate School of Journalism, in The importance of provenance on his BuzzMachine blog, June, 2010

Tuesday, June 5, 2012

Running less.js on the JVM Server

less.js is a css templating language with a javascript file to convert templates into a CSS file.

The less.js distribution includes a Rhino patch to run less.js from the command line using rhino. less.js no longer produces a Rhino version, but the patch remains available in the master branch.

less.js 1.3.0 uses ECMA-5 and will attempt to upgrade the Object and Array prototypes if run in a non-ECMA-5 environment. This prevents the script running in many ECMA environments that ship with jdk6.

There are two popular jars available that provide a Java API for less.js. Both of them use Rhino to run the script in the JVM.

Asual`s has been around longer and hacks the rhino patch to run as a library. Asual requires the latest version of Rhino.

lesscss-java claims to be the official java version and includes envjs (mimic a browser's script environment for running html apps offline). This allows the library to run less.js just as it would run in the browser. Envjs requires the latest version of rhino.

If you try and run less.js using the ECMA script in jdk6, you may find that the core object/prototypes are sealed and cannot be extended.

The version of ECMA script on Mac jvms seems to be only ECMA-3.1 or JavaScript 1.5. To run less.js you have to patch it to use utility functions instead of ECMA-5 functions. less.js also requires window and document objects to function. However, you can get away with the following environment.

        var window = {};
        var location = {port:0};
        var document = {
            getElementsByTagName: function(){return []},
            getElementById: function(){return null}
        };
        var require = function(arg) {
            return window.less[arg.split('/')[1]];
        };

less.js uses XMLHttpRequest to import referenced documents. If you want to load other files yourself, best to override the window.less.Parser.importer function.

The function takes (path, paths, callback, env), where path is the import url, paths is an array (passed in from the constructor options), callback is a function to send the results, and env is the constructor options. The callback takes (e, root, content), where e is a thrown error, root is the parse tree and content is the file's contents (for error reporting). Here is a skeleton of the code you would need to run on jdk6.
        var contents = {};
        window.less.Parser.importer = function(path, paths, callback, env) {
            if (path != null) {
                var uri = new java.net.URI(paths[0]).resolve(path).normalize();
                var content = ...  // TODO read the uri content as a string
                var dir = uri.resolve(".").normalize();
                var file = dir.relativize(uri).toASCIIString();
                contents[file] = content;
                var parser = new window.less.Parser({
                    optimization: 3,
                    filename: file,
                    opaque: true,
                    paths: [dir.toASCIIString()]
                });
                parser.imports.contents = contents;
                parser.parse(content, function (e, root) {
                    if (e) throw e;
                    callback(e, root, content);
                });
            }
        };

To help debug less.js errors the above has a fix for issue 592. All new window.less.Parser have a imports.contents map and this map needs to have the basename of any imported file to resolve error locations. If the map does not contain the basename, a charAt error is thrown.

If running server side you may also be interested in this patch to inline both less and CSS files. The opaque flag above turns this on.

Sunday, January 15, 2012

Blob Store

In release 2.0-beta14 (I know, this is the late beta release) AliBaba introduced a new BLOB store. The blob store integrates with the RDF repository ObjectRepository to synchronize transactions. This allows both the BLOB store and the RDF store to be isolated and always consistent with one another. This is done using two-phase commit transactions in the BLOB store.

The BLOB store also has a few other advantages over a traditional file system. First every change is isolated until it is closed/committed. This prevents other readers from see an incomplete BLOB and help prevent inconsistency between the BLOB and RDF stores. In additional, as disk space is generally considered cheap, all past versions of BLOBs are keep on disk by default. This allows any previous versions to be retrieved (and restored) using the API.

The BLOB store API is fairly simple. Here what some code might look like using the BLOB store.

BlobStoreFactory factory = BlobStoreFactory.newInstance();
BlobStore store = factory.openBlobStore(new File("."));
String key = "http://example.com/store1/key1";
BlobObject blob = store.open(key);
OutputStream out = blob.openOutputStream();
try {
// write stream to out
} finally {
out.close();
}
InputStream in = blob.openInputStream();
try {
// read stream from in
} finally {
in.close();
}

More API options can be see in the JavaDocs: