Scalable and declarative text mining on Big Data processing platforms

Stratosphere is a data analytics platform designed to cover a wide variety of use cases, including the analysis of structured data, semi-structured data, and unstructured, textual data. These data flows are specified in a declarative scripting language called Meteor (note that there also exist other APIs). Meteor scripts are composed of primitive operators, which are defined in domain-specific packages, i.e., self-contained libraries of the operator implementations, their syntax, and semantic annotations. A Meteor script is parsed into an algebraic representation, logically optimized, and compiled into a parallel data flow program of parallelization primitives (e.g., map, reduce, cross) embracing the operators implementations. Subsequently, the parallel data flow program is physically optimized, translated into an execution graph and deployed on the given hardware. Currently, the system ships more than 60 different operators organized in four packages, i.e., general purpose (BASE), information extraction (IE), web analytics (WA), and data cleansing (DC).
In this project, we designed and developed the IE and WA packages for Stratosphere, which are with more than 40 operators the largest available packages. The IE package includes operators for annotating texts with syntactic annotations (e.g., sentence boundaries, part-of-speech tags) and semantic annotations (e.g., mentions of different entity types, relationships between entities), and for merging annotations using different schemes. The WA package contains operators specific to the analysis of web documents, such as link extraction, markup removal, or markup repair.

Download and instructions for use are available upon request.

Publications

  • Heise, A., Rheinländer, A., Leich, M., Naumann, F., and Leser, U.: Meteor/Sopremo: An Extensible Query Language and Operator Model. Int. Workshop on End-to-end Management of Big Data (BigData 2012), held in conjunction with 38th Int. Conf. on Very Large Data Bases (VLDB), Istanbul, Turkey, 2012.
  • Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise, A., Kao O., Leich, M., Leser U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The Stratosphere Platform for Big Data Analytics. VLDB Journal, 2014.
  • Leich, M., Adamek, J., Schubotz, M., Heise, A., Rheinländer, A., Markl, V.: Applying Stratosphere for Big Data Analytics Demo at BTW 2013, Magdeburg, Germany.
  • 2nd International Workshop on High Performance Bioinformatics and Biomedicine (HiBB), Bordeaux, France, 2011.