Piranha (software)

Piranha is a text mining system. It was developed for the United States Department of Energy by Oak Ridge National Laboratory. The software processes free-text documents and shows relationships amongst them, a technique valuable across numerous data domains, from health care fraud to national security. The results are presented in clusters of prioritized relevance. Piranha uses the term frequency/inverse corpus frequency term weighting method which provides strong parallel processing of textual information, thus the ability to analyze large document sets.
Piranha has six main elements:

Collecting and Extracting: Millions of documents from sources such as databases and social media can be collected and text extracted from hundreds of file formats; This information can be translated to other languages.
Storing and indexing: Documents in search servers, relational databases, etc. can be stored and indexed.
Recommending: The system can highlight the most valuable information for specific users.
Categorizing: Grouping items via supervised and semi-supervised machine learning methods and targeted search lists.
Clustering: Similarity is used to group documents hierarchically.
Visualizing: Showing relationships among documents so that users can quickly recognize connections.

This work has resulted in eight patents, and commercial licenses, a spin-off company with the inventors, Covenant Health, and Pro2Serve called VortexT Analytics, two R&D 100 Awards, and scores of peer reviewed research publications.

Awards

2007 R&D 100 Magazine's Award ''''

Patents

– System for gathering and summarizing internet information
– Method for gathering and summarizing internet information
– Agent-based method for distributed clustering of textual information
– Dynamic reduction of dimensions of a document vector in a document search and retrieval system
– ''Method and system for determining precursors of health abnormalities from processing medical records''