At LegalTech 2015 in New York City last month, I took part in a panel discussion regarding the future of machine learning in information governance. It was encouraging to see the high degree of interest that many attendees had in the topic. For years, the legal community has been grappling with the best way to comply with litigation mandates that, in many cases, require searching and reviewing terabytes of information. Technology improvements in areas such as natural language processing and latent semantic indexing are enabling remarkable productivity gains. In the legal world, machine learning is being used to power predictive coding, the organization and prioritization of entire sets of electronically stored information (ESI) based upon their relation to discovery responsiveness, privilege, and other designated legal criteria. I believe we are about to see this same technology deeply impact information governance in an area I’m referring to as “predictive categorization.”

One of the major challenges organizations face as they prepare to implement a comprehensive information governance policy lies in understanding the nature of the information being managed. The ability to answer questions such as: Is this a business record? Do we have a regulatory mandate to preserve it? What record category does it fit in? etc. is dependent on understanding the content and context of the information being analyzed. Historically, these questions have been answered in one of two ways – either through the metadata associated with the information (i.e. how old is it, who authored it) or via manual categorization by an end user. Both of these methods can offer a passable level of control for smaller data sets but as the volume and variety of ESI continues to grow, the ability to categorize by user or metadata begins to break down.

We’ve seen first-hand how the Content Analyst machine learning engine called CAAT can effectively analyze large volumes of content and establish category relationships between documents through a process called dynamic clustering – and the results are amazing! We’re currently working on plans for integrating CAAT to deliver predictive categorization in our Altitude IG information governance platform.

“In order to help corporate clients effectively handle the rapidly-growing volumes and variety of unstructured content, information governance solution providers must adopt new intelligent technologies that automate the processes,” said Kurt Michel, Content Analyst president and CEO. “Sherpa Software’s adoption of the Content Analyst machine learning technology will allow their users to effectively transfer and apply human subject matter expertise to their large collections of unstructured content that must be managed. The addition of our proven, defensible advanced analytics technology will provide Sherpa Software with a true competitive advantage when it comes to classification. We applaud Sherpa for creating a highly-differentiated solution for the information governance market with their Altitude IG product, and we are proud to have the CAAT engine powering their classification capabilities.”

Concurrent with that effort, we are also planning a multi-part white paper series focused on how machine learning can improve your information governance program. Stay tuned for more details or feel free to reach out to Sherpa directly at