Piranha (software)
Piranha is a text mining system. It was developed for the United States Department of Energy by Oak Ridge National Laboratory. The software processes free-text documents and shows relationships amongst them, a technique valuable across numerous data domains, from health care fraud to national security. The results are presented in clusters of prioritized relevance. Piranha uses the term frequency/inverse corpus frequency term weighting method which provides strong parallel processing of textual information, thus the ability to analyze large document sets.
Piranha has six main elements:
- Collecting and Extracting: Millions of documents from sources such as databases and social media can be collected and text extracted from hundreds of file formats; This information can be translated to other languages.
- Storing and indexing: Documents in search servers, relational databases, etc. can be stored and indexed.
- Recommending: The system can highlight the most valuable information for specific users.
- Categorizing: Grouping items via supervised and semi-supervised machine learning methods and targeted search lists.
- Clustering: Similarity is used to group documents hierarchically.
- Visualizing: Showing relationships among documents so that users can quickly recognize connections.
Awards
- 2007 R&D 100 Magazine's Award ''''
Patents
- – System for gathering and summarizing internet information
- – Method for gathering and summarizing internet information
- – Agent-based method for distributed clustering of textual information
- – Dynamic reduction of dimensions of a document vector in a document search and retrieval system
- – ''Method and system for determining precursors of health abnormalities from processing medical records''