Piranha (software)


Piranha is a text mining system. It was developed for the United States Department of Energy by Oak Ridge National Laboratory. The software processes free-text documents and shows relationships amongst them, a technique valuable across numerous data domains, from health care fraud to national security. The results are presented in clusters of prioritized relevance. Piranha uses the term frequency/inverse corpus frequency term weighting method which provides strong parallel processing of textual information, thus the ability to analyze large document sets.
Piranha has six main elements:
  • Collecting and Extracting: Millions of documents from sources such as databases and social media can be collected and text extracted from hundreds of file formats; This information can be translated to other languages.
  • Storing and indexing: Documents in search servers, relational databases, etc. can be stored and indexed.
  • Recommending: The system can highlight the most valuable information for specific users.
  • Categorizing: Grouping items via supervised and semi-supervised machine learning methods and targeted search lists.
  • Clustering: Similarity is used to group documents hierarchically.
  • Visualizing: Showing relationships among documents so that users can quickly recognize connections.
This work has resulted in eight patents, and commercial licenses, a spin-off company with the inventors, Covenant Health, and Pro2Serve called VortexT Analytics, two R&D 100 Awards, and scores of peer reviewed research publications.

Awards

  • 2007 R&D 100 Magazine's Award ''''

Patents

  • System for gathering and summarizing internet information
  • Method for gathering and summarizing internet information
  • Agent-based method for distributed clustering of textual information
  • Dynamic reduction of dimensions of a document vector in a document search and retrieval system
  • – ''Method and system for determining precursors of health abnormalities from processing medical records''