Wednesday, October 3, 2012

Provenance and Traceability in RDF with Callimachus

Provenance and Traceability in RDF with Callimachus

All too often software is designed without adequate regard for traceability. Traceability refers to the ability to audit the state of data at any point in the system for correctness and completeness; for any entity in the system all transactions that to lead to the current state and their metadata can be examined, reviewed, and verified. Software is supposed to be designed according to the stakeholders' requirements, but many of these experts take traceability for granted. Most people don't audit most of the time, but the ability to audit at all requires traceability all the time.
Consider the common scenario where a business is trying to provide some semi-automation to a business process. Often businesses are trying to move from an informal email-based process to a web-based semi-automated process. Such a move can reduce human involvement and make the process faster and more efficient, leading to greater productivity. However, few participants realize the inherit traceability of email-based processes. Moving away from email-based to web-based, without proper consideration, can kill a company's ability to audit the process for correctness and completeness.
Today most web-based systems are built using SQL databases. However, the rigid nature of SQL-based systems creates a significant barrier for adding traceability to an existing SQL-based system. Traceability is not an add-on feature; it requires deep integration into every change and every transaction. This is something many SQL-based systems cannot easily provide.
Papers on digital traceability date as far back as 1986. However a quarter of a century later, there are still no standards for tracking digital conceptual objects (as there is in many other industries for the traceability of physical objects). Furthermore, following the digital explosion of data in the past decade and the increased reliance on information from the Web, there is a growing challenge that no one seems to know whether any of this information that is being collected is accurate or not.
This may change in 2013, as the W3C has been working on a general provenance information standard since 2009 that is scheduled for release next year. Specifically it is to support the widespread publication and use of provenance information of Web documents, data, and resources. Specifically, they are defining a provenance interchange language and methods to publish and access provenance metadata using this language.
The PROV specification (currently in last call) defines things as entities, activities, and agents. Entities are physical, digital, conceptual, or any other kinds of thing. Examples of such entities are a web page, a chart, and a spellchecker. Activities are how entities come into existence and how their attributes change. Agents takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software, an inanimate object, an organization, or other entities that may be ascribed responsibility.
Callimachus 0.18 will be the first Callimachus release to use this new PROV language to seamlessly describe all the activities that take place in the system. The Callimachus project was named after the man that created the first library catalogue; so it should not be too surprising that the project continues this legacy by creating metadata about every activity performed.
When a new resource is created, metadata is stored in the triple store to record the event. These activities are stored in the RDF store in a named graph, for example, when the RDF of a create form submits the following triples:
</sun> a </callimachus/Concept> , skos:Concept ;
    skos:prefLabel "Sun" ;
    skos:definition "The great luminary" .
Additional authorization information is copied from the class and parent folder that includes:
</sun> calli:reader </group/public> ;
calli:subscriber </group/everyone> ;
calli:editor </group/staff> ;
calli:administrator </group/admin> .
Callimachus uses this authorization information as a simple authorization model that is similar to the ACL of a file system. Here the groups or users of the system are assigned authorization rights to the resource. calli:reader provides read-only access, calli:subscriber provides access to the resources history and provenance data and grants the ability to discuss or comment on the resource, calli:editor provides the ability to change the resource, and calli:administrator provides the ability to change the authorization information.
The resource is also inserted into the parent folder using the following triple:
</> calli:hasComponent </sun> .
Callimachus provides a hierarchical view of resources that mimics the path segments of their identifier. This hierarchical relationship is captured using the inverse-functional calli:hasComponent property from the parent resource to its child. The reason this is an inverse-functional relationship is to require proper authorization to change the parent resource when adding a new child resource.
Finally, all these triples are combined and stored in the RDF store in an activity graph, along with the PROV metadata of the activity itself. The prov:wasGeneratedBy is a functional property that links the resource entities to the last activity that modified it. The prov:generated/prov:specializationOf links the activity to the resource entities it modified.
GRAPH </activity/2012/11/08/t1> {

    </activity/2012/11/08/t1> a </callimachus/Activity>, audit:RecentBundle ;
        calli:reader </group/everyone>, </group/staff>, </group/admin> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> ;
        prov:wasInfluencedBy </activity/2012/11/08/t0> .

    </activity/2012/11/08/t1#provenance> a prov:Activity ;
        prov:startedAtTime "2012-11-08T15:07:22.869Z"^^xsd:dateTime ;
        prov:wasAssociatedWith </user/james> ;
        prov:generated </activity/2012/11/08/t1#!/sun> ;
        prov:generated </activity/2012/11/08/t1#!/> ;
        prov:generated </activity/2012/11/08/t1#!/activity/2012/11/08/> ;
        prov:endedAtTime "2012-11-08T15:07:24.583Z"^^xsd:dateTime .

    </activity/2012/11/08/t1#!/sun>
        prov:specializationOf </sun> .
    </sun> a </callimachus/Concept>, skos:Concept ;
        calli:administrator </group/admin> ;
        calli:editor </group/staff> ;
        calli:reader </group/public> ;
        calli:subscriber </group/everyone> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> ;
        skos:definition "The great luminary" ;
        skos:prefLabel "Sun" .

    </activity/2012/11/08/t1#!/>
        prov:specializationOf </> ;
        prov:wasRevisionOf </activity/2012/11/08/t0#!/> .
    </>
        calli:hasComponent </sun> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> .

    </activity/2012/11/08/t1#!/activity/2012/11/08/>
        prov:specializationOf </activity/2012/11/08/> ;
        prov:wasRevisionOf </activity/2012/11/08/t0#!/activity/2012/11/08/> .
    </activity/2012/11/08/>
        calli:hasComponent </activity/2012/11/08/t1> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> .
}
Modifying a resource is a bit trickier as Callimachus stores both the previous version as well as the new version of the resource. If the clients send the following update to the server:
DELETE DATA {
    </sun> skos:definition "The great luminary" .
};
INSERT DATA {
    </sun> skos:definition "The lamp of day" .
};
Three triples are removed (not just one) from all graphs in the RDF store.
DELETE DATA {
    </sun> skos:definition "The great luminary" ;
        prov:wasGeneratedBy </activity/2012/10/02/t1> .

    </activity/2012/10/02/> prov:wasGeneratedBy </activity/2012/10/02/t1> .
};
The triple is then replaced with the following to keep the semantics of the first activity intact.
INSERT DATA {
    GRAPH </activity/2012/11/08/t1> {
        </activity/2012/11/08/t1#!/sun> audit:with </activity/2012/11/08/t2#5eef4c8f> .
    } 
} 
In addition, a new named graph is created with the following, to represent this new activity.
GRAPH </activity/2012/11/08/t2> {

    </activity/2012/11/08/t2> a </callimachus/Activity> , audit:RecentBundle ;
        calli:reader </group/everyone>, </group/staff>,  </group/admin>;
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> ;
        prov:wasInfluencedBy </activity/2012/11/08/t1> .

    </activity/2012/11/08/t2#provenance> a prov:Activity ;
        prov:startedAtTime "2012-11-08T15:19:31.199Z"^^xsd:dateTime ;
        prov:wasAssociatedWith </user/james> ;
        prov:generated </activity/2012/11/08/t2#!/sun> ;
        prov:generated </activity/2012/11/08/t2#!/activity/2012/11/08/> ;
        prov:endedAtTime "2012-11-08T15:19:31.295Z"^^xsd:dateTime .

    </activity/2012/11/08/t2#!/sun> ;
        audit:without </activity/2012/11/08/t2#5eef4c8f> ;
        prov:specializationOf </sun> ;
        prov:wasRevisionOf </activity/2012/11/08/t1#!/sun> .
    </activity/2012/11/08/t2#5eef4c8f>
        rdf:object "The great luminary" ;
        rdf:predicate skos:definition ;
        rdf:subject </sun> .
    </sun>
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> ;
        skos:definition "The lamp of day" .

    </activity/2012/11/08/t2#!/activity/2012/11/08/>
        prov:specializationOf </activity/2012/11/08/> ;
        prov:wasRevisionOf </activity/2012/11/08/t1#!/activity/2012/11/08/> .
    </activity/2012/11/08/>
        calli:hasComponent </activity/2012/11/08/t2> ;
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> .
}
Callimachus also allows users to upload RDF triple files (rdf+xml and turtle). When an entire RDF triple file is uploaded the metadata stored is slightly different. If the file data.rdf is uploaded to the home folder, all the triples in the file are inserted into the named graph </data.rdf>. In addition, the following named graph is created and the binary file is stored permanently on disk associated with the same activity identifier.
GRAPH </activity/2012/11/08/t3> {

</activity/2012/11/08/t3> a </callimachus/Activity>, audit:RecentBundle ;
    calli:reader </group/everyone>, </group/staff>, </group/admin> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> ;
    prov:wasInfluencedBy </activity/2012/11/08/t1> ;
    prov:wasInfluencedBy </activity/2012/11/08/t2> .

</activity/2012/11/08/t3#provenance> a prov:Activity ;
    prov:startedAtTime "2012-11-08T15:36:40.039Z"^^xsd:dateTime ;
    prov:wasAssociatedWith </user/james> ;
    prov:generated </activity/2012/11/08/t3#!/data.rdf> ;
    prov:generated </activity/2012/11/08/t3#!/> ;
    prov:generated </activity/2012/11/08/t3#!/activity/2012/11/08/> ;
    prov:endedAtTime "2012-11-08T15:36:40.951Z"^^xsd:dateTime .

</activity/2012/11/08/t3#!/data.rdf> ;
    prov:specializationOf </data.rdf> .

</data.rdf> a </callimachus/NamedGraph>, sd:NamedGraph, foaf:Document ;
    calli:administrator </group/admin> ;
    calli:editor </group/staff> ;
    calli:reader </group/public> ;
    calli:subscriber </group/everyone> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> ;
    dcterms:identifier "data" .

</activity/2012/11/08/t3#!/>
    prov:specializationOf </> ;
    prov:wasRevisionOf </activity/2012/11/08/t2#!/> .

</>
    calli:hasComponent </data.rdf> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> .

</activity/2012/11/08/t3#!/activity/2012/11/08/>
    prov:specializationOf </activity/2012/11/08/> ;
    prov:wasRevisionOf </activity/2012/11/08/t2#!/activity/2012/11/08/> .

</activity/2012/11/08/>
    calli:hasComponent </activity/2012/11/08/t3> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance>  .
}
All of this metadata is readily available in the history tab or by following the rel=version-history link in the page, other atom feed, or Link header in an OPTIONS response. The metadata is formatted as an html list or as an atom feed. Both of these representations include links to the PROV activity that modified the resource.
These named metadata activity graphs together provide an audit trail of all the entities in the system in a transparent way, linked together with the common prov:used relationships. This allows the software developers and the stakeholders to focus on their value-added features.
More information about Callimachus can be found at the project page at http://callimachusproject.org/.

References for the post include:
Tim Berners-Lee, W3C Chair, Web Design Issues, September 1997
John Sheridan, UK National Archives, data.gov.uk, February 2010
Jill Mesirov, Chief Informatics Officer of the MIT/Harvard Broad Institute, in Science, January 2010
Luc Moreau, University of Southampton, in The Foundations of Provenance on the Web, November, 2009
Vinton Cerf, Internet pioneer, in Smithsonian's "40 Things you need to know about the next 40 years" issue, July, 2010
Jeff Jarvis, media company consultant and associate professor at the City University of New York's Graduate School of Journalism, in The importance of provenance on his BuzzMachine blog, June, 2010

No comments:

Post a Comment