One of the most interesting talks at the U.K. e-Science 2011 Meeting was on “ Cloud Infrastructure for Large-Scale, Open-Source Text Processing,” by Valentin Tablan. The paper was, in fact, voted the best paper at the meeting.

Valentin described “GATE,” a widely used open source infrastructure for language engineering that now makes use of cloud technology. It uses an automated approach to organizing a pile of documents, that the project describes as follows in its 2-minute guide:

  1. Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth) — call this your corpus.
  2. Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy, or something from the Linked Data cloud) — call this your ontology.
  3. Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.).
  4. Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard.
  5. Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded).
  6. Use GATE Mimir to store the annotations relative to the ontology in a multiparadigm index server. (For techies: this sits in the backroom as a RESTful web service.)
  7. Use Ontotext KIM to add semantic search, knowledge facet search, ontology browsing, entity popularity graphing, time series graphing, annotation structure search and (last but not least) boolean full text search. (More techy stuff: mash up these types of search with your existing UI’s.)

The project has many science customers, including:

myGrid: Extending the GRID framework of distributed conputing by producing a virtual laboratory bench that will support the life sciences community and make use of complex distributed resources.

multiFlora: An e-science bioinformatics project for biodiversity support.

CLEF: Building on E-Science technology to embed a full information cycle within practical clinical systems, building tools to integrate patient information from text and images, and linking clinical and genomic research.

I do not see any astronomy projects as customers. I wonder if digitization of old paper journals may benefit from technology such as this?

