How machine learning is changing information governance

While machine learning sounds interesting, how does it differ from what is currently on the market? Why is this technology so important to the future success of compliance operations in your company? There is some debate over its definition, but one of machine learning’s pioneers put it like this:

“Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.” – Arthur Samuel

Arthur Samuel demonstrated this in the 1950’s by developing a checkers-playing program on IBM’s first commercially available computer—think Watson 1.0. The program played itself in order to learn strategy and winning moves. Its debut increased IBM’s stock 15 points overnight.

Today, machine learning algorithms have evolved from playing checkers to performing complex tasks, like automatically tagging friends in Facebook photos, ranking websites in Google, providing Netflix and Amazon suggestions—Stanford even created an autonomous RC helicopter using machine learning.

Stanford Autonomous Helicopter:

This type of autonomous learning, or unsupervised learning, is probably what most think of when discussing machine learning. However, learning algorithms are most commonly distinguished as either being supervised or unsupervised.

Supervised machine learning such as auto categorization requires there to be some structure applied to the data.  Feeding examples (the supervised part) as representative of a category is an example.  An accurate example is predicting whether a tumor is malignant or not. In order to produce a yes or no answer to this complicated diagnosis, a learning algorithm takes multiple data sets from past patients that relate to the attributes of each tumor and patient: age, tumor size, clump thickness, uniformity of cell size, uniformity of cell shape and any other data that the hospital has available. Given that information, a learning algorithm is then able to accept the attributes of a new patient’s tumor and predict the likelihood of the tumor being malignant or not.

Conceptual search is also considered supervised machine learning.  Engines learn from the ingestion of the content and can then find similar based on the conceptual space it’s created.  Think categorization, go find more like this, except rather than categorize similar, they show the results in relevance ranked order.

Unsupervised machine learning does not require a predefined attribute structure. You are able to give the algorithm all of the data you have, and it then sorts it autonomously. The image below is what results from when a set of genomic information is given to an unsupervised learning algorithm that is asked to organize that data. Here, a clustering algorithm automatically separates genetics information into similar groups and colors them by degree.

genetic machine learning

Machine Learning and Information Governance

Recently we announced our partnership with Content Analyst Company. The integration of Content Analyst’s CAAT machine learning engine into our Altitude IG platform will allow our software platform to automatically group and classify electronically stored information (ESI) with accuracy unseen in the information governance industry.

A key point with regard to machine learning in IG is that it’s not dependent on some big list of keywords, terms and phrases that someone needs to manually create in order to find things containing those terms.  In the case of defensible deletion of email for example, you might want to find all emails older than 1 year, that are likely mass emails and hence highly probably to be ROT (redundant, outdated or trivial).  Think daily newsletters, email marketing offers, etc.  Using keywords, you might say to look for the word “unsubscribe” which may indicate it came from a mass email blast.  But not all mass emails use the word “unsubscribe” – some use “manage your subscriptions” and others use “opt out”, etc.  So someone has to come up with that list.  Likewise for content needed for retention, whether email or documents.  With machine learning, it’s a matter of taking examples and finding more like this, regardless of the individual terms (and misspellings, abbreviations, acronyms, etc.) used in the content.

As such, an information manager can highlight the words they are looking for, or preferably,  those words in context – as in the entire sentence, paragraph or section of the document as representative of what they are looking for.  Take the word “driver” for example.  Suppose I’m on the QA team for a computer hardware manufacturer.  “We found an issue with our software driver and we want to investigate how it happened.”  Highlighting the word “driver” would also bring back anything related to the company’s fleet of thousands of delivery truck drivers, if selected alone. But if selected in the context of its use as a software driver, CAAT would be able to distinguish its use vs. its use in the context of trucks and deliveries.  That’s the machine learning understanding the conceptual difference between a software driver and a truck driver, just like you did when you read this paragraph.  The word ‘driver’ next to (or in the same sentence or paragraph as) the word ‘software’ gives it different context than driver next to the word truck.

Content Analyst’s CAAT engine puts the power of machine learning in the hands of information professionals. Instead of creating complex rules that will filter and sort the files based on keywords, the CAAT engine is able to take several example documents that represent the file content by which they would like to sort. The CAAT engine will use the manager’s content suggestions to filter and sort semantically-related documents into appropriate categories. As more documents are submitted to the corpus, the CAAT engine is able to evaluate the semantic relevance of the document and dynamically filter it into the correct category. Every time a new document is submitted to the corpus, the engine is able to learn new ways to filter documents into their correct category. The system eliminates the time needed for an information professional to maintain complex filtering rules. The algorithm now does that for you.

To learn more about machine learning with CAAT and Sherpa Software, click here or contact a Sherpa representative today!


Leave a Reply

Your email address will not be published. Required fields are marked *