Tuesday, March 12, 2013

Actor Model: Multi-threaded Parallel Processing in Java

Actor Model: Multi-threaded Parallel Processing in Java

Intent

The Actor model provides a constrained way to use multi-threaded parallel processing. Each Actor is used to process queued requests (or "messages") one at a time, as one stage of a pipeline. Below is one way to implement the Actor model in Java.

Motivation

When multiple tasks need to be performed in a pipeline, there is sometimes a desire to execute them concurrently using separate threads, for example to take advantage of hardware with multiple CPU cores. But since concurrent sequential processing is notoriously prone to bugs that are difficult to replicate and isolate, it is helpful to use a programming model that imposes some structure on the use of separate threads and how they can interact.

In the Actor model, each actor runs in its own thread and only operates locally on its own queue of tasks. Multiple actors can be set up in a pipeline to work in parallel, each actor consuming the tasks in its own queue, and potentially adding tasks to the queues of other actors. For example, one actor may save RDF graphs from an HTTP endpoint, while another actor downstream later performs a computation on those graphs.

Another reason for using an actor that run in its own thread, processing one task at a time, is to impose throttling, so that too many threads are not trying to run at once. Throttling is not only helpful in preventing one client from consuming inordinate resources. In many cases it can actually improve total throughput, by preventing resource contention.

Implementation

Each actor is represented by a separate Java class and has its own queue of similar tasks (or messages) that it will process, one at a time. Each task is represented as an instance of the actor class. A task's constructors create an instance that can be sent to the actor by calling the "execute" method on that instance, thus queuing the task for processing. The constructors take required parameters as arguments; optional parameters may be set via setter methods.

Each actor (a Java class) has its own thread, which it uses to asynchronously process the tasks in its queue. Different actors process different type of tasks in different queues and different threads. A different actor thread is held in a static field called "actor" in each task class. The actor is an instance of a standard class called ExecutorService that is provided by the JVM. The queue is inside the actor and maintained by the actor, so custom code does not need to deal with the queue maintenance, thus (hopefully) reducing the opportunity for thread programming errors.

Each task class must implement a "run" method, which will be called by the JVM when it is time to process one instance (or message).

Sample Code

We'll sketch out how to define a task class called GraphReaderTask, which will read RDF graphs from a set of URLs, and store those graphs into an RDF repository using Sesame. First, we'll need to import some of the standard Java concurrency classes:
import java.util.concurrent.*;
Our GraphReaderTask class (or a nested class) must implement the Callable interface:
public class GraphReaderTask implements Callable<Void> {
Here is the static "actor" that holds the thread for the GraphReaderTask:
private static final ExecutorService actor = Executors.newSingleThreadExecutor();
Next, some fields that the GraphReaderTask will need, in processing each message. Each message
private final Repository repository; // A Sesame RDF repository
private final String url; // The URL of an RDF graphs to save
private Future<Void> ctrl;
Now we can define a GraphReaderTask constructor and control methods:
public GraphReaderTask(Repository repository, String url) {
        this.repository = repository;
        this.url = url;
}
public boolean isSubmitted() {
 return ctrl != null;
}
public boolean isCancelled() {
 return ctrl != null && ctrl.isCancelled();
}
public boolean isDone() {
 return ctrl != null && ctrl.isDone();
}
public synchronized void submit() {
 if (ctrl == null) {
  ctrl = actor.submit(this);
 } else {
  throw new IllegalStateException();
 }
}
public boolean cancel() {
 return ctrl.cancel(false);
}
public void await() throws InterruptedException, IOException,
  OpenRDFException {
 try {
  ctrl.get();
 } catch (ExecutionException e) {
  try {
   throw e.getCause();
  } catch (Error cause) {
   throw cause;
  } catch (RuntimeException cause) {
   throw cause;
  } catch (IOException cause) {
   throw cause;
  } catch (OpenRDFException cause) {
   throw cause;
  } catch (Throwable cause) {
   throw new UndeclaredThrowableException(cause);
  }
 }
}
Now we can define a "call" method, which will be invoked by the actor when it is time to process the next task from this actor's queue, and must perform the guts of whatever this actor/task should do. In this example, the GraphReaderTask simply reads an RDF graph from a URL and stores it into our Sesame repository.

Remember that all instances of GraphReaderTask are associated with one actor, that is an ExecutorService, which provides features for shutting down gracefully. So before we actually start doing any work, we first need to check to see if the task is cancelled, and if so, merely return without doing anything (except perhaps writing a note to a log).
public Void call() throws IOException, OpenRDFException {
 if (isCancelled())
  return null;
 URLConnection http = new URL(url).openConnection();
 http.setRequestProperty("Accept", "application/rdf+xml");
 InputStream in = http.getInputStream();

 RepositoryConnection con = repository.getConnection();
 con.setAutoCommit(false);
 try {
  ValueFactory vf = con.getValueFactory();
  URI graph = vf.createURI(url);

  con.clear(graph);
  if (isCancelled())
   return null;
  con.add(in, url, RDFFormat.RDFXML, graph);

  con.setAutoCommit(true);
 } finally {
  con.rollback();
  con.close();
  in.close();
 }
 return null;
}
Now that we have defined the GraphReaderTask class, we need to make use of it.
new GraphReaderTask(repository, url).submit();
This technique allows the caller thread to continue with other processing, while the dedicated graph reader thread takes care of parsing RDF. By using a queue we ensure that threads are not blocked when they could be performing other operations. The await() method can be used by the caller to re-join when the task is complete and propagate any exceptions that may have occurred.

Wednesday, October 3, 2012

Provenance and Traceability in RDF with Callimachus

Provenance and Traceability in RDF with Callimachus

All too often software is designed without adequate regard for traceability. Traceability refers to the ability to audit the state of data at any point in the system for correctness and completeness; for any entity in the system all transactions that to lead to the current state and their metadata can be examined, reviewed, and verified. Software is supposed to be designed according to the stakeholders' requirements, but many of these experts take traceability for granted. Most people don't audit most of the time, but the ability to audit at all requires traceability all the time.
Consider the common scenario where a business is trying to provide some semi-automation to a business process. Often businesses are trying to move from an informal email-based process to a web-based semi-automated process. Such a move can reduce human involvement and make the process faster and more efficient, leading to greater productivity. However, few participants realize the inherit traceability of email-based processes. Moving away from email-based to web-based, without proper consideration, can kill a company's ability to audit the process for correctness and completeness.
Today most web-based systems are built using SQL databases. However, the rigid nature of SQL-based systems creates a significant barrier for adding traceability to an existing SQL-based system. Traceability is not an add-on feature; it requires deep integration into every change and every transaction. This is something many SQL-based systems cannot easily provide.
Papers on digital traceability date as far back as 1986. However a quarter of a century later, there are still no standards for tracking digital conceptual objects (as there is in many other industries for the traceability of physical objects). Furthermore, following the digital explosion of data in the past decade and the increased reliance on information from the Web, there is a growing challenge that no one seems to know whether any of this information that is being collected is accurate or not.
This may change in 2013, as the W3C has been working on a general provenance information standard since 2009 that is scheduled for release next year. Specifically it is to support the widespread publication and use of provenance information of Web documents, data, and resources. Specifically, they are defining a provenance interchange language and methods to publish and access provenance metadata using this language.
The PROV specification (currently in last call) defines things as entities, activities, and agents. Entities are physical, digital, conceptual, or any other kinds of thing. Examples of such entities are a web page, a chart, and a spellchecker. Activities are how entities come into existence and how their attributes change. Agents takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software, an inanimate object, an organization, or other entities that may be ascribed responsibility.
Callimachus 0.18 will be the first Callimachus release to use this new PROV language to seamlessly describe all the activities that take place in the system. The Callimachus project was named after the man that created the first library catalogue; so it should not be too surprising that the project continues this legacy by creating metadata about every activity performed.
When a new resource is created, metadata is stored in the triple store to record the event. These activities are stored in the RDF store in a named graph, for example, when the RDF of a create form submits the following triples:
</sun> a </callimachus/Concept> , skos:Concept ;
    skos:prefLabel "Sun" ;
    skos:definition "The great luminary" .
Additional authorization information is copied from the class and parent folder that includes:
</sun> calli:reader </group/public> ;
calli:subscriber </group/everyone> ;
calli:editor </group/staff> ;
calli:administrator </group/admin> .
Callimachus uses this authorization information as a simple authorization model that is similar to the ACL of a file system. Here the groups or users of the system are assigned authorization rights to the resource. calli:reader provides read-only access, calli:subscriber provides access to the resources history and provenance data and grants the ability to discuss or comment on the resource, calli:editor provides the ability to change the resource, and calli:administrator provides the ability to change the authorization information.
The resource is also inserted into the parent folder using the following triple:
</> calli:hasComponent </sun> .
Callimachus provides a hierarchical view of resources that mimics the path segments of their identifier. This hierarchical relationship is captured using the inverse-functional calli:hasComponent property from the parent resource to its child. The reason this is an inverse-functional relationship is to require proper authorization to change the parent resource when adding a new child resource.
Finally, all these triples are combined and stored in the RDF store in an activity graph, along with the PROV metadata of the activity itself. The prov:wasGeneratedBy is a functional property that links the resource entities to the last activity that modified it. The prov:generated/prov:specializationOf links the activity to the resource entities it modified.
GRAPH </activity/2012/11/08/t1> {

    </activity/2012/11/08/t1> a </callimachus/Activity>, audit:RecentBundle ;
        calli:reader </group/everyone>, </group/staff>, </group/admin> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> ;
        prov:wasInfluencedBy </activity/2012/11/08/t0> .

    </activity/2012/11/08/t1#provenance> a prov:Activity ;
        prov:startedAtTime "2012-11-08T15:07:22.869Z"^^xsd:dateTime ;
        prov:wasAssociatedWith </user/james> ;
        prov:generated </activity/2012/11/08/t1#!/sun> ;
        prov:generated </activity/2012/11/08/t1#!/> ;
        prov:generated </activity/2012/11/08/t1#!/activity/2012/11/08/> ;
        prov:endedAtTime "2012-11-08T15:07:24.583Z"^^xsd:dateTime .

    </activity/2012/11/08/t1#!/sun>
        prov:specializationOf </sun> .
    </sun> a </callimachus/Concept>, skos:Concept ;
        calli:administrator </group/admin> ;
        calli:editor </group/staff> ;
        calli:reader </group/public> ;
        calli:subscriber </group/everyone> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> ;
        skos:definition "The great luminary" ;
        skos:prefLabel "Sun" .

    </activity/2012/11/08/t1#!/>
        prov:specializationOf </> ;
        prov:wasRevisionOf </activity/2012/11/08/t0#!/> .
    </>
        calli:hasComponent </sun> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> .

    </activity/2012/11/08/t1#!/activity/2012/11/08/>
        prov:specializationOf </activity/2012/11/08/> ;
        prov:wasRevisionOf </activity/2012/11/08/t0#!/activity/2012/11/08/> .
    </activity/2012/11/08/>
        calli:hasComponent </activity/2012/11/08/t1> ;
        prov:wasGeneratedBy </activity/2012/11/08/t1#provenance> .
}
Modifying a resource is a bit trickier as Callimachus stores both the previous version as well as the new version of the resource. If the clients send the following update to the server:
DELETE DATA {
    </sun> skos:definition "The great luminary" .
};
INSERT DATA {
    </sun> skos:definition "The lamp of day" .
};
Three triples are removed (not just one) from all graphs in the RDF store.
DELETE DATA {
    </sun> skos:definition "The great luminary" ;
        prov:wasGeneratedBy </activity/2012/10/02/t1> .

    </activity/2012/10/02/> prov:wasGeneratedBy </activity/2012/10/02/t1> .
};
The triple is then replaced with the following to keep the semantics of the first activity intact.
INSERT DATA {
    GRAPH </activity/2012/11/08/t1> {
        </activity/2012/11/08/t1#!/sun> audit:with </activity/2012/11/08/t2#5eef4c8f> .
    } 
} 
In addition, a new named graph is created with the following, to represent this new activity.
GRAPH </activity/2012/11/08/t2> {

    </activity/2012/11/08/t2> a </callimachus/Activity> , audit:RecentBundle ;
        calli:reader </group/everyone>, </group/staff>,  </group/admin>;
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> ;
        prov:wasInfluencedBy </activity/2012/11/08/t1> .

    </activity/2012/11/08/t2#provenance> a prov:Activity ;
        prov:startedAtTime "2012-11-08T15:19:31.199Z"^^xsd:dateTime ;
        prov:wasAssociatedWith </user/james> ;
        prov:generated </activity/2012/11/08/t2#!/sun> ;
        prov:generated </activity/2012/11/08/t2#!/activity/2012/11/08/> ;
        prov:endedAtTime "2012-11-08T15:19:31.295Z"^^xsd:dateTime .

    </activity/2012/11/08/t2#!/sun> ;
        audit:without </activity/2012/11/08/t2#5eef4c8f> ;
        prov:specializationOf </sun> ;
        prov:wasRevisionOf </activity/2012/11/08/t1#!/sun> .
    </activity/2012/11/08/t2#5eef4c8f>
        rdf:object "The great luminary" ;
        rdf:predicate skos:definition ;
        rdf:subject </sun> .
    </sun>
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> ;
        skos:definition "The lamp of day" .

    </activity/2012/11/08/t2#!/activity/2012/11/08/>
        prov:specializationOf </activity/2012/11/08/> ;
        prov:wasRevisionOf </activity/2012/11/08/t1#!/activity/2012/11/08/> .
    </activity/2012/11/08/>
        calli:hasComponent </activity/2012/11/08/t2> ;
        prov:wasGeneratedBy </activity/2012/11/08/t2#provenance> .
}
Callimachus also allows users to upload RDF triple files (rdf+xml and turtle). When an entire RDF triple file is uploaded the metadata stored is slightly different. If the file data.rdf is uploaded to the home folder, all the triples in the file are inserted into the named graph </data.rdf>. In addition, the following named graph is created and the binary file is stored permanently on disk associated with the same activity identifier.
GRAPH </activity/2012/11/08/t3> {

</activity/2012/11/08/t3> a </callimachus/Activity>, audit:RecentBundle ;
    calli:reader </group/everyone>, </group/staff>, </group/admin> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> ;
    prov:wasInfluencedBy </activity/2012/11/08/t1> ;
    prov:wasInfluencedBy </activity/2012/11/08/t2> .

</activity/2012/11/08/t3#provenance> a prov:Activity ;
    prov:startedAtTime "2012-11-08T15:36:40.039Z"^^xsd:dateTime ;
    prov:wasAssociatedWith </user/james> ;
    prov:generated </activity/2012/11/08/t3#!/data.rdf> ;
    prov:generated </activity/2012/11/08/t3#!/> ;
    prov:generated </activity/2012/11/08/t3#!/activity/2012/11/08/> ;
    prov:endedAtTime "2012-11-08T15:36:40.951Z"^^xsd:dateTime .

</activity/2012/11/08/t3#!/data.rdf> ;
    prov:specializationOf </data.rdf> .

</data.rdf> a </callimachus/NamedGraph>, sd:NamedGraph, foaf:Document ;
    calli:administrator </group/admin> ;
    calli:editor </group/staff> ;
    calli:reader </group/public> ;
    calli:subscriber </group/everyone> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> ;
    dcterms:identifier "data" .

</activity/2012/11/08/t3#!/>
    prov:specializationOf </> ;
    prov:wasRevisionOf </activity/2012/11/08/t2#!/> .

</>
    calli:hasComponent </data.rdf> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance> .

</activity/2012/11/08/t3#!/activity/2012/11/08/>
    prov:specializationOf </activity/2012/11/08/> ;
    prov:wasRevisionOf </activity/2012/11/08/t2#!/activity/2012/11/08/> .

</activity/2012/11/08/>
    calli:hasComponent </activity/2012/11/08/t3> ;
    prov:wasGeneratedBy </activity/2012/11/08/t3#provenance>  .
}
All of this metadata is readily available in the history tab or by following the rel=version-history link in the page, other atom feed, or Link header in an OPTIONS response. The metadata is formatted as an html list or as an atom feed. Both of these representations include links to the PROV activity that modified the resource.
These named metadata activity graphs together provide an audit trail of all the entities in the system in a transparent way, linked together with the common prov:used relationships. This allows the software developers and the stakeholders to focus on their value-added features.
More information about Callimachus can be found at the project page at http://callimachusproject.org/.

References for the post include:
Tim Berners-Lee, W3C Chair, Web Design Issues, September 1997
John Sheridan, UK National Archives, data.gov.uk, February 2010
Jill Mesirov, Chief Informatics Officer of the MIT/Harvard Broad Institute, in Science, January 2010
Luc Moreau, University of Southampton, in The Foundations of Provenance on the Web, November, 2009
Vinton Cerf, Internet pioneer, in Smithsonian's "40 Things you need to know about the next 40 years" issue, July, 2010
Jeff Jarvis, media company consultant and associate professor at the City University of New York's Graduate School of Journalism, in The importance of provenance on his BuzzMachine blog, June, 2010

Tuesday, June 5, 2012

Running less.js on the JVM Server

less.js is a css templating language with a javascript file to convert templates into a CSS file.

The less.js distribution includes a Rhino patch to run less.js from the command line using rhino. less.js no longer produces a Rhino version, but the patch remains available in the master branch.

less.js 1.3.0 uses ECMA-5 and will attempt to upgrade the Object and Array prototypes if run in a non-ECMA-5 environment. This prevents the script running in many ECMA environments that ship with jdk6.

There are two popular jars available that provide a Java API for less.js. Both of them use Rhino to run the script in the JVM.

Asual`s has been around longer and hacks the rhino patch to run as a library. Asual requires the latest version of Rhino.

lesscss-java claims to be the official java version and includes envjs (mimic a browser's script environment for running html apps offline). This allows the library to run less.js just as it would run in the browser. Envjs requires the latest version of rhino.

If you try and run less.js using the ECMA script in jdk6, you may find that the core object/prototypes are sealed and cannot be extended.

The version of ECMA script on Mac jvms seems to be only ECMA-3.1 or JavaScript 1.5. To run less.js you have to patch it to use utility functions instead of ECMA-5 functions. less.js also requires window and document objects to function. However, you can get away with the following environment.

        var window = {};
        var location = {port:0};
        var document = {
            getElementsByTagName: function(){return []},
            getElementById: function(){return null}
        };
        var require = function(arg) {
            return window.less[arg.split('/')[1]];
        };

less.js uses XMLHttpRequest to import referenced documents. If you want to load other files yourself, best to override the window.less.Parser.importer function.

The function takes (path, paths, callback, env), where path is the import url, paths is an array (passed in from the constructor options), callback is a function to send the results, and env is the constructor options. The callback takes (e, root, content), where e is a thrown error, root is the parse tree and content is the file's contents (for error reporting). Here is a skeleton of the code you would need to run on jdk6.
        var contents = {};
        window.less.Parser.importer = function(path, paths, callback, env) {
            if (path != null) {
                var uri = new java.net.URI(paths[0]).resolve(path).normalize();
                var content = ...  // TODO read the uri content as a string
                var dir = uri.resolve(".").normalize();
                var file = dir.relativize(uri).toASCIIString();
                contents[file] = content;
                var parser = new window.less.Parser({
                    optimization: 3,
                    filename: file,
                    opaque: true,
                    paths: [dir.toASCIIString()]
                });
                parser.imports.contents = contents;
                parser.parse(content, function (e, root) {
                    if (e) throw e;
                    callback(e, root, content);
                });
            }
        };

To help debug less.js errors the above has a fix for issue 592. All new window.less.Parser have a imports.contents map and this map needs to have the basename of any imported file to resolve error locations. If the map does not contain the basename, a charAt error is thrown.

If running server side you may also be interested in this patch to inline both less and CSS files. The opaque flag above turns this on.

Sunday, January 15, 2012

Blob Store

In release 2.0-beta14 (I know, this is the late beta release) AliBaba introduced a new BLOB store. The blob store integrates with the RDF repository ObjectRepository to synchronize transactions. This allows both the BLOB store and the RDF store to be isolated and always consistent with one another. This is done using two-phase commit transactions in the BLOB store.

The BLOB store also has a few other advantages over a traditional file system. First every change is isolated until it is closed/committed. This prevents other readers from see an incomplete BLOB and help prevent inconsistency between the BLOB and RDF stores. In additional, as disk space is generally considered cheap, all past versions of BLOBs are keep on disk by default. This allows any previous versions to be retrieved (and restored) using the API.

The BLOB store API is fairly simple. Here what some code might look like using the BLOB store.

BlobStoreFactory factory = BlobStoreFactory.newInstance();
BlobStore store = factory.openBlobStore(new File("."));
String key = "http://example.com/store1/key1";
BlobObject blob = store.open(key);
OutputStream out = blob.openOutputStream();
try {
// write stream to out
} finally {
out.close();
}
InputStream in = blob.openInputStream();
try {
// read stream from in
} finally {
in.close();
}

More API options can be see in the JavaDocs:

Thursday, June 2, 2011

Web Developer Review of BlackBerry PlayBook

Most reviews for the PlayBook focus on the same issue: very few downloadable apps in app world. As a web developer - I couldn't care less.

First Impression

Websites render fast, and due to the high dpi, look really nice. With its compact form, it fits well in my hands, easy to type and very portable. With a flash plugin included, streaming video is smooth and full screen works. Videos look really slick when plugged into a HD TV. Each app can only open one window, so the browser supports tabs and allows you to keep multiple tabs open at once.

Honeymoon Ends

Tabbed browsing works on the desktop, but not on the PlayBook. Only the open tab can be actively loading. Opening a new tab before the page loads can cancel the page from loading. Opening a new tab while watching video pauses the video. This makes watching commercials really frustrating because you can't turn away or it will pause. Watching videos in the browser is also frustrating, as after five minutes the PlayBook goes into suspend. (There are some tricks to stop this, but not if in full screen mode.)

In addition, despite all the fuss about multitasking, the PlayBook can't multitask. Most specifically you can only have one web page active at a time, and this includes webapps.

Surprisingly, the PlayBook is much less web developer friendly than I expected. The script engine is incomplete. There is no offline support for webapps. There is no support for turning a webapp application into a chromeless app. Webworks development requires a series of confusing bat commands that don't work the first time. All of this makes it really hard to develop for the PlayBook.

What's Left

The apps I use include Browser, Wi-Fi Sharing, Word To Go, Slides To Go, Videos, Pictures, aVNC, and ReelPortal. All of them work, but I expected more from almost every one of them.

All that being said, I am going to hold on and put up with the current limitation of the PlayBook. I really like having a portable web browser, and I believe there is still a lot of potential for this device. I am looking forward to seeing what the next software update has to offer.

Monday, February 14, 2011

Five Steps to a More Secure Web App

There are a number of different authentication methods available to choose from when launching (or updating) a Web application. Choosing the wrong method can leave the system (or worse, the users) vulnerable to cyber attacks or identity theft.

Below are five rules that should always be obeyed (regardless of the method). By considering these rules and how your users will use your system, you can better understand the security requirements of your Web application and can choose the right method.

1) Never send clear user passwords over an unencrypted channel.

When passwords are sent over an unencrypted channel, anyone who has access to the network (and a little know how) can read them. This should never be done with user supplied passwords (not even for intranet websites). Users often use the same password for multiple systems. Exposing a user's password in one system puts them at risk in another.

Both basic authentication and form-based authentication are vulnerable to this and should never be used when users can choose their own passwords. Digest authentication and encrypted logins do not send clear passwords, and can be used when users can choose their own passwords.

HTTP basic authentication and HTML form-based logins can be used in secure networks to restrict Web access as long as the passwords are pseudo random, unpredictable, and unique across other systems.

For systems that allow user created passwords, care must be taken to ensure the passwords are not readable by others by using HTTPS or digest during logins.

2) Never send session tokens unencrypted over a shared network.

Unencrypted session tokens are visible to anyone who has access to the network. Although session tokens don't expose the user's password, they do allow hijacking accounts with unlimited access. This should never be used over a public wifi network (or other shared network) to access private information or make changes.

Cookie based authentication over HTTP is vulnerable to this. Digest authentication and HTTPS sessions are not vulnerable.

Digest authentication uses a unique "salt" for every request and digest systems prevent the same "salt" being used more than once (although this is optional). By never using and never allowing the same authentication token twice, digest authentication prevents account hijacking.

HTTPS requests are encrypted and prevent eavesdropping from others on the network, preventing access to any request tokens that might be present.

Only allow HTTPS using keys from a certificate authority, HTTPS with self signed keys, HTTPS with mixed content or digest authentication should be used to exchange private information over shared networks.

For more information about the vulnerabilities of using session tokens see Weaning the Web Off of Session Cookies.

3) Always verify information sent over an insecure network.

Insecure networks may be vulnerable to malicious attacks such as DNS posioning, or a trojan Web proxy. These attacks are often called man-in-the-middle and can manipulate the content from the server before it reaches the client (and vice-versa).

Most unencrypted HTTP communication is vulnerable to this. Even mixed content of both HTTPS and HTTP is vulnerable to man-in-the-middle because compromised HTTP content can read and manipulate HTTPS content.

Although digest authentication includes an optional integrity check to prevent this, most browsers either don't check or don't indicate to the user if the content has been verified.

All Web browsers verify HTTPS content (when not mixed) and this should be used for insecure networks. For mobile devises that often connect from potentially insecure networks HTTPS (self signed or CA signed) should be enabled by default for any private information.

4) Never give confidential information without verifying authenticity of the server.

Well disguised URLs and familiar looking pages can trick users into visiting and pseudo-logging into illegitimate websites. If your website asks your users for confidential information, ensure there is a clear way for your users to verify the authenticity of the site before logging in. Otherwise, your users might give confidential information to untrustworthy third parties without even knowing it.

HTTPS using previously distributed keys (such as keys from an established certificate authority) allow the user to verify the organization in their browser (near the address bar). This allows the user to quickly verify authenticity of the server.

HTTPS with self sign certificates cannot be used to verify authenticity unless they have been previously distributed through a secure channel.

Although digest authentication can include authentication-info to verify authenticity, most browsers either ignore it or don't indicate to the user when the site is verified. However, most browsers do show the host name and realm to the user for review before logging in and this does give the user a chance to check to domain name before logging in.

Always use HTTPS for confidential, or sensitive information.

5) Never access sensitive information over an unencrypted channel.

HTTP traffic can be viewed by any who has access to the network. It is vital that all sensitive information is never sent over unencrypted HTTP. Sensitive information should always use HTTPS.

Only exclusively HTTPS with known certificates should be used to exchange sensitive information with its users.

Always use HTTPS for confidential, or sensitive information.

In summary

By obeying these five rules you can pick the right authentication method and prevent your system and users from being vulnerable to cyber attacks and identify theft.

Sunday, November 28, 2010

Status Code 200 vs 303

The public LOD has been dominated by discussions on using 303 in response to a GET request for distinguishing between the requested resource identifier, and a description document identifier.

Some resources can be represented completely on the Web. For these resources, any of their URLs can be used to identify them. This blog page, for example, can be identified by the URL in a browser's address bar. However, some resources cannot be completely viewed on the Web - they can only be described on the Web.

The W3C recommends responding with a 200 status code for GET requests of a URL that identifies a resource which can be completely represented on the Web (an information resource). They also recommend responding with a 303 for GET requests of a URL that identifies a resource that cannot be completely represented on the Web.

Popular Web servers today don't have much support for resources that can't be represented on the Web. This creates a problem for deploying (non-document) resource servers as it can be very difficult to set-up resources for 303 responses. The public LOD mailing list has been discussing an alternative of using the more common 200 response for any resource.

The problem with always responding to a GET request with a 200 is the risk of using the same URL to identify both a resource and a document describing it. This breaks a fundamental Web constraint that says URIs identify a single resource, and causes URI collisions.

It is impossible to be completely free of all ambiguity when it comes to URI allocation. However, any ambiguity can impose a cost in communication due to the effort required to resolve it. Therefore, within reason, we should strive to avoid it. This is particularly true for Web recommendation standards.

URI collision is perhaps the most common ambiguity in URI allocation. Consider a URL that refers to the movie The Sting and also identifies a description document about the movie. This collision creates confusion about what the URL identifies. If one wanted to talk about the creator of the resource identified by the URL, it would be unclear whether this meant "the creator of the movie" or "the editor of the description." Such ambiguity can be avoided using a 303 for a movie URL to redirect to a 200 of the description URL.

As Tim Berners-Lee points out in an email, even including a Content-Location in a 200 response (to indicate a description of the requested resource) "leaves the web not working", because such techniques are already used to associate different representations (and different URLs) to the same resource, and not the other way around.

Using any other 200 status code for representations that merely describe a resource (and don't completely represent it) causes ambiguity because Web browsers today interpret all 200 series responses (from a GET request) as containing an complete representation of the resource identified in the request URL.

Every day, people bookmark and send links of documents they are viewing in a Web browser. It is essential that any document viewed in a Web browser has a URL identifier in the browser's address bar. Web browsers today don't look at the Content-Location header to get the document URL (nor should they). For Linked Data to work with today's Web, it must keep requests for resources separate from requests for description documents.

The community has voiced common concerns about the complexity of URI allocation and the use of 303s using today's software. The LOD community jumped in with a few alternatives, however, we must consider how the Web works today and be realistic on further Web client expectations. The established 303 technique works today using today's Web browsers. 303 redirect may be complicated to setup in a document server, but let's give Linked Data servers a chance to mature.