Enriching Record Metadata via the Google Vision API

Many times the title of the uploaded file doesn't convey any real information.  We often ask users to supply additional terms, but we can also use machine learning models to automatically tag records.  This enhances the user's experience and provides more opportunities for search.  

faulkner.jpg
Automatically generated keywords, provided by the Vision API

Automatically generated keywords, provided by the Vision API

In the rest of the post I'll show how to build this plugin and integrate it with the Google Vision Api...


First things first, I created a solution within Visual Studio that contains one class library.  The library contains one class named Addin, which is derived from the TrimEventProcessorAddIn base class.  This is the minimum needed to be considered an "Event Processor Addin".

using HP.HPTRIM.SDK;
 
namespace CMRamble.EventProcessor.VisionApi
{
    public class Addin : TrimEventProcessorAddIn
    {
        public override void ProcessEvent(Database db, TrimEvent evt)
        {
        }
    }
}

Next I'll add a class library project with a skeleton method named AttachVisionLabelsAsTerms.  This method will be invoked by the Event Processor and will result in keywords being attached for a given record.  To do so it will call upon the Google Vision Api.  The event processor itself doesn't know anything about the Google Vision Api.

using HP.HPTRIM.SDK;
 
namespace CMRamble.VisionApi
{
    public static class RecordController
    {
        public static void AttachVisionLabelsAsTerms(Record rec)
        {
 
        }
    }
}

Before I can work with the Google Vision Api, I have to import the namespace via the NuGet package manager.

The online documentation provides this sample code that invokes the Api:

var image = Image.FromFile(filePath);
var client = ImageAnnotatorClient.Create();
var response = client.DetectLabels(image);
foreach (var annotation in response)
{
    if (annotation.Description != null)
        Console.WriteLine(annotation.Description);
}

I'll drop this into a new static method in my VisionApi class library.  To re-use the sample code I'll need to pass the file path into the method call and then return a list of labels.  I'll mark the method private so that it can't be directly called from the Event Processor Addin.

private static List<string> InvokeDetectLabels(string filePath)
{
    List<string> labels = new List<string>();
    var image = Image.FromFile(filePath);
    var client = ImageAnnotatorClient.Create();
    var response = client.DetectLabels(image);
    foreach (var annotation in response)
    {
        if (annotation.Description != null)
            labels.Add(annotation.Description);
    }
    return labels;
}

Now I can go back to my record controller and build-out the logic.  I'll need to extract the record to disk, invoke the new InvokeDetectLabels method, and work with the results.  Ultimately I should include error handling and logging, but for now this is sufficient.

public static void AttachVisionLabelsAsTerms(Record rec)
{
    // formulate local path names
    string fileName = $"{rec.Uri}.{rec.Extension}";
    string fileDirectory = $"{System.IO.Path.GetTempPath()}\\visionApi";
    string filePath = $"{fileDirectory}\\{fileName}";
    // create storage location on disk
    if (!System.IO.Directory.Exists(fileDirectory)) System.IO.Directory.CreateDirectory(fileDirectory);
    // extract the file
    if (!System.IO.File.Exists(filePath) ) rec.GetDocument(filePath, false"GoogleVisionApi", filePath);
    // get the labels
    List<string> labels = InvokeDetectLabels(filePath);
    // process the labels
    foreachvar label in labels )
    {
        AttachTerm(rec, label);
    }
    // clean-up my mess
    if (System.IO.File.Exists(filePath)) try { System.IO.File.Delete(filePath); } catch ( Exception ex ) { }
}

I'll also need to create a new method named "AttachTerm".  This method will take the label provided by google and attach a keyword (thesaurus term) for each.  If the term does not yet exist then it will create it.

private static void AttachTerm(Record rec, string label)
{
    // if record does not already contain keyword
    if ( !rec.Keywords.Contains(label) )
    {
        // fetch the keyword
        Keyword keyword = null;
        try { keyword = new HP.HPTRIM.SDK.Keyword(rec.Database, label); } catch ( Exception ex ) { }
        if (keyword == null)
        {
            // when it doesn't exist, create it
            keyword = new Keyword(rec.Database);
            keyword.Name = label;
            keyword.Save();
        }
        // attach it
        rec.AttachKeyword(keyword);
        rec.Save();
    }
}

Almost there!  Last step is to go back to the event processor add in and update it to use the record controller.  I'll also need to ensure I'm only calling the Vision API for supported image types and in certain circumstances.  After making those changes I'm left with the code shown below.

using System;
using HP.HPTRIM.SDK;
using CMRamble.VisionApi;
 
namespace CMRamble.EventProcessor.VisionApi
{
    public class Addin : TrimEventProcessorAddIn
    {
        public const string supportedExtensions = "png,jpg,jpeg,bmp";
        public override void ProcessEvent(Database db, TrimEvent evt)
        {
            switch (evt.EventType)
            {
                case Events.DocAttached:
                case Events.DocReplaced:
                    if ( evt.RelatedObjectType == BaseObjectTypes.Record )
                    {
                        InvokeVisionApi(new Record(db, evt.RelatedObjectUri));
                    }
                    break;
                default:
                    break;
            }
        }
 
        private void InvokeVisionApi(Record record)
        {
            if ( supportedExtensions.Contains(record.Extension.ToLower()) )
            {
                RecordController.AttachVisionLabelsAsTerms(record);
            }
        }
    }
}

Next I copied the compiled solution onto the workgroup server and registered the add-in via the Enterprise Studio. 

2018-05-19_7-57-09.png

 

Before I can test it though, I'll need to create a service account within google.  Once created I'll download the API key as a json file and place it onto the server.

The API requires that the path to the json file be referenced within an environment variable.  The file can be placed anywhere on the server that is accessible by the CM service account.  This is done within the system properties contained in the control panel.

2018-05-19_7-51-45.png

Woot woot!  I'm ready to test.  I should now be able to drop an image into the system and see some results!  I'll use the same image as provided within the documentation, so that I can ensure similar results.  

2018-05-19_8-16-19.png

Sweet!  Now I don't need to make users pick terms.... let the cloud do it for me!

Automating the generation of Tesseract OCR text renditions

Although IDOL will index the contents of PDF documents, it does not perform its' own OCR of the content (at least the OEM connector for CM does not).  In the JFK archives this means I can only search on the stamped annotation on each image.  Even if IDOL re-OCR'd documents, I can't easily extract the words it finds.  I need to do that when researching records, performing a retention analysis, culling keywords for a record hold, or writing scope notes for categorization purposes.  In the previous post I created a record addin that generated a plain text file that held OCR content from the tesseract engine.    

Moving forward I want to automate these OCR tasks.  For instance, anytime a new document is attached we should have a new OCR rendition generated.  I think it makes sense to take the solution from the previous post and add to it.  The event processor plugin I create should call the same logic as the client add-in.  If this approach works out, I can then add a ServiceAPI plugin to expose the same functionality into that framework.

So I took the code from the last post and added another C# class library.  I added one class that derived from the event processor addin class.  It required one method be implemented: ProcessEvent.  Within that method I check if the record is being reindex, the document has been replaced, the document has been attached, or a rendition has changed.  If so I called the methods from the TextExtractor library used in the previous post. 

using HP.HPTRIM.SDK;
using System;
using System.IO;
using System.Reflection;
 
namespace CMRamble.Ocr.EventProcessorAddin
{
    public class Addin : TrimEventProcessorAddIn
    {
        #region Event Processing
        public override void ProcessEvent(Database db, TrimEvent evt)
        {
            Record record = null;
            RecordRendition rendition;
            if (evt.ObjectType == BaseObjectTypes.Record)
            {
                switch (evt.EventType)
                {
                    case Events.ReindexWords:
                    case Events.DocReplaced:
                    case Events.DocAttached:
                    case Events.DocRenditionRemoved:
                        record = db.FindTrimObjectByUri(BaseObjectTypes.Record, evt.ObjectUri) as Record;
                        RecordController.UpdateOcrRendition(record, AssemblyDirectory);
                        break;
                    case Events.DocRenditionAdded:
                        record = db.FindTrimObjectByUri(BaseObjectTypes.Record, evt.ObjectUri) as Record;
                        var eventRendition = record.ChildRenditions.FindChildByUri(evt.RelatedObjectUri) as RecordRendition;
                        if ( eventRendition != null && eventRendition.TypeOfRendition == RenditionType.Original )
                        {   // if added an original
                            rendition = eventRendition;
                            RecordController.UpdateOcrRendition(record, rendition, Path.Combine(AssemblyDirectory, "tessdata\\"));
                        }
                        break;
                    default:
                        break;
                }
            }
        }
        #endregion
        public static string AssemblyDirectory
        {
            get
            {
                string codeBase = Assembly.GetExecutingAssembly().CodeBase;
                UriBuilder uri = new UriBuilder(codeBase);
                string path = Uri.UnescapeDataString(uri.Path);
                return Path.GetDirectoryName(path);
            }
        }
    }
}
 

Note that I created the AssemblyDirectory property so that the tesseract OCR path can be located correctly.  Since this is spawned from TRIMEvent.exe the executing directory is the installation path of Content Manager.  The tesseract language files are in a different location though.  To work around this I pass the AssemblyDirectory property into the TextExtractor.

I updated the UpdateOcrRendition method in the RecordController class so that it accepted the assemblypath.  If the assembly path is not passed then I default the value to the original value which is relative.  The record add-in can then be updated to match this approach.

2017-11-14_20-53-36.png

Within the TextExtractor class I added a parameter to the required method.  I could then pass it directly into the tesseract engine during instantiation.  

2017-11-14_20-56-41.png

If you expand upon this concept you can see how it's possible to use different languages or trainer data.  For now I need to go back and add one additional method.  In the event processor I reacted to when a new rendition was added, but I didn't implement the logic.  So I need to create a record controller method that works for renditions.

public static bool OcrRendition(Record record, RecordRendition sourceRendition, string tessData = @"./tessdata")
{
    bool success = false;
    string extractedFilePath = string.Empty;
    string ocrFilePath = string.Empty;
    try
    {
        // get a temp working location on disk
        var rootDirectory = Path.Combine(Path.GetTempPath(), "cmramble_ocr");
        if (!Directory.Exists(rootDirectory)) Directory.CreateDirectory(rootDirectory);
        // formulate file name to extract, delete if exists for some reason
        extractedFilePath = Path.Combine(rootDirectory, $"{sourceRendition.Uri}.{sourceRendition.Extension}");
        ocrFilePath = Path.Combine(rootDirectory, $"{sourceRendition.Uri}.txt");
        FileHelper.Delete(extractedFilePath);
        FileHelper.Delete(ocrFilePath);
        // fetch document
        var extract = sourceRendition.GetExtractDocument();
        extract.FileName = Path.GetFileName(extractedFilePath);
        extract.DoExtract(Path.GetDirectoryName(extractedFilePath), truefalse"");
        if (!String.IsNullOrWhiteSpace(extract.FileName) && File.Exists(extractedFilePath)) {
            ocrFilePath = TextExtractor.ExtractFromFile(extractedFilePath, tessData);
            // use record extension method that removes existing OCR rendition (if exists)
            record.AddOcrRendition(ocrFilePath);
            record.Save();
            success = true;
        }
    }
    catch (Exception ex)
    {
    }
    finally
    {
        FileHelper.Delete(extractedFilePath);
        FileHelper.Delete(ocrFilePath);
    }
    return success;
}

Duplicating code is never a great idea, I know.  This is just for fun though so I'm not going to stress about it.  Now I hit compile and then register my event processor addin, like shown below.

2017-11-14_21-09-31.png

I then enabled the configuration status and saved/deployed...

2017-11-14_21-10-24.png

Over in the client I removed the OCR rendition by using the custom button on my home ribbon...

2017-11-14_21-13-59.png

When I then monitor the event processor I can see somethings been queued!

2017-11-14_21-11-55.png

A few minutes later I've got a new OCR rendition attached.

2017-11-14_21-17-24.png

Progress!  Next thing I need to do is train tesseract.  Many of these records are typed and not handwritten.  That means I should be able to create a set of trainer data that improves the confidence of the OCR text.  Additionally, I'd like to be able to compare the results from the original PDF and the tesseract results. 

Using Tesseract-OCR within the Client

In a previous post I showed how to generate OCR renditions via Powershell.  The process worked quite well, and the accuracy is higher than other solutions.  After that post I went to upload the powershell scripts to github and decided to re-run each script against a new dataset. 

As I ran the OCR script I noticed a few things I did not like about it:

  1. The script ran fine for hours and then bombed because the search results went stale
  2. I must remember to run the script after each import of records, or no OCR renditions
  3. I had to create a custom property to track whether an OCR rendition was generated

To overcome these challenges I'll need to write some code.  Time to break out Visual Studio and build a new solution.  So let's dive right in!  


I opened up Microsoft Visual Studio 2017 and created a new solution with two projects: a C# class library for the add-in, and a C# class library for the Ocr functionality.  Here I'm splitting the Ocr functionality into a separate project because in the next post I'll create an event processor plug-in.  To make this work I updated the first project to reference the second and set a build dependency between the two.

Next I implemented the ITrimAddIn interface and organized the interface stubs into logical regions, as shown below.  I also created a folder named MenuLinks and created two new classes within: UpdateOcrRendition and RemoveOcrRendition.  Those classes will expose the menu options to the users within the client.

2017-11-14_8-03-16.png

The two menu link classes look are defined as follows:

 
using HP.HPTRIM.SDK;
 
namespace CMRamble.Ocr.ClientAddin.MenuLinks
{
    public class UpdateOcrRendition : TrimMenuLink
    {
        public const int LINK_ID = 8002;
        public override int MenuID => LINK_ID;
        public override string Name => "Update Ocr Rendition";
        public override string Description => "Uses the document content to generate OCR text";
        public override bool SupportsTagged => true;
 
    }
}
 
 
using HP.HPTRIM.SDK;
namespace CMRamble.Ocr.ClientAddin.MenuLinks
{
    public class RemoveOcrRendition : TrimMenuLink
    {
        public const int LINK_ID = 8003;
        public override int MenuID => LINK_ID;
        public override string Name => "Remove Ocr Rendition";
        public override string Description => "Remove any Ocr Renditions";
        public override bool SupportsTagged => true;
    }
}
 

Now in the Add-in class I create a local variable to store the array of MenuLinks, update the Initialise interface stub to instantiate that array, and then force the GetMenuLinks method to return that array....

private TrimMenuLink[] links;
public override void Initialise(Database db)
{
    links = new TrimMenuLink[2] { new MenuLinks.UpdateOcrRendition(), new MenuLinks.RemoveOcrRendition() };
}
public override TrimMenuLink[] GetMenuLinks()
{
    return links;
}

Next up I need to complete the IsMenuItemEnabled method.  I do this by switching on the command link ID passed into the method.  I compare it to the constant value that backs my Menu Link Id's.  If you look closely at the code below, you'll notice that I'm calling "HasOcrRendition" when the link matches my RemoveOcrRendition link.  There is no such method in the out-of-the-box .Net SDK.  Here I'll be calling a static extension method contained inside the other library.  I'm doing this because I know I'll need that same capability (to know if there is an Ocr rendition) across multiple libraries.  It also makes the code easier to read.

public override bool IsMenuItemEnabled(int cmdId, TrimMainObject forObject)
{
    switch (cmdId)
    {
        case MenuLinks.UpdateOcrRendition.LINK_ID:
            return forObject.TrimType == BaseObjectTypes.Record && ((HP.HPTRIM.SDK.Record)forObject).IsElectronic;
        case MenuLinks.RemoveOcrRendition.LINK_ID:
            return forObject.TrimType == BaseObjectTypes.Record && ((Record)forObject).HasOcrRendition();
        default:
            return false;
    }
}

The last two methods I need to implement within my record add-in are named "ExecuteLink".  Here I'll hand the implementation details off to a static class contained within my second project.  Doing so makes this code easy to understand and even easier to maintain.

public override void ExecuteLink(int cmdId, TrimMainObject forObject, ref bool itemWasChanged)
{
    HP.HPTRIM.SDK.Record record = forObject as HP.HPTRIM.SDK.Record;
    if ((HP.HPTRIM.SDK.Record)record != null)
    {
        switch (cmdId)
        {
            case MenuLinks.UpdateOcrRendition.LINK_ID:
                RecordController.UpdateOcrRendition(record);
                break;
            case MenuLinks.RemoveOcrRendition.LINK_ID:
                RecordController.RemoveOcrRendition(record);
                break;
            default:
                break;
        }
    }
}
public override void ExecuteLink(int cmdId, TrimMainObjectSearch forTaggedObjects)
{
    switch (cmdId)
    {
        case MenuLinks.UpdateOcrRendition.LINK_ID:
            RecordController.UpdateOcrRenditions(forTaggedObjects);
            break;
        case MenuLinks.RemoveOcrRendition.LINK_ID:
            RecordController.RemoveOcrRenditions(forTaggedObjects);
            break;
        default:
            break;
    }
}

Now I need to build the desired functionality within the solution's second project.  To start I'll go ahead and import the tesseract library via the Nuget package manager.  As of this post the latest stable version was 3.0.2.  Note that I also imported the CM .Net SDK and System.Drawing.

2017-11-14_8-21-48.png

Next I downloaded the latest english language data files and placed them into the required tessdata sub-folder.  I also updated the properties of each so that they copy to the output folder if needed.

2017-11-14_8-29-59.png

I decide to now implement the remove ocr rendition feature.  One method will work on a single record and a second method will work on a set of tagged objects (same approach as with the Client Addin).  To make it super simple I'm not presenting any sort of user interface or options.  

#region Remove Ocr Rendition
public static bool RemoveOcrRendition(Record record)
{
    return record.RemoveOcrRendition();
}
public static void RemoveOcrRenditions(TrimMainObjectSearch forTaggedObjects)
{
    foreach (var result in forTaggedObjects)
    {
        HP.HPTRIM.SDK.Record record = result as HP.HPTRIM.SDK.Record;
        if ((HP.HPTRIM.SDK.Record)record != null)
        {
            RemoveOcrRendition(record);
        }
    }
} 
#endregion

I again used an extension method, this time naming it "RemoveOcrRendition".  I create a new class named "RecordExtensions", mark it static, and implement the functionality.  I also add one last extension method that handles the creation of a new ocr rendition.  The contents of that class is included below.

using HP.HPTRIM.SDK;
namespace CMRamble.Ocr
{
    public static class RecordExtensions
    {
        public static void AddOcrRendition(this Record record, string fileName)
        {
            if (record.HasOcrRendition()) record.RemoveOcrRendition();
            record.ChildRenditions.NewRendition(fileName, RenditionType.Ocr, "Ocr");
        }
        public static bool RemoveOcrRendition(this Record record)
        {
            bool removed = false;
            for (uint i = 0; i < record.ChildRenditions.Count; i++)
            {
                RecordRendition rendition = record.ChildRenditions.getItem(i) as RecordRendition;
                if ((RecordRendition)rendition != null && rendition.TypeOfRendition == RenditionType.Ocr)
                {
                    rendition.Delete();
                    removed = true;
                }
            }
            record.Save();
            return removed;
        }
        public static bool HasOcrRendition(this Record record)
        {
            for (uint i = 0; i < record.ChildRenditions.Count; i++)
            {
                RecordRendition rendition = record.ChildRenditions.getItem(i) as RecordRendition;
                if ((RecordRendition)rendition != null && rendition.TypeOfRendition == RenditionType.Ocr)
                {
                    return true;
                }
            }
            return false;
        }
    }
}

Now that I have the remove ocr rendition functionality complete I can move onto the update functionality.  In order to OCR the file I must first extract it to disk.  Then I can extract the text by calling the tesseract library and saving the results back as a new ocr rendition.  The code below implements this within the Record Controller class (which is invoked by the addin).

#region Update Ocr Rendition
public static bool UpdateOcrRendition(Record record)
{
    bool success = false;
    string extractedFilePath = string.Empty;
    string ocrFilePath = string.Empty;
    try
    {
        // get a temp working location on disk
        var rootDirectory = Path.Combine(Path.GetTempPath(), "cmramble_ocr");
        if (!Directory.Exists(rootDirectory)) Directory.CreateDirectory(rootDirectory);
        // formulate file name to extract, delete if exists for some reason
        extractedFilePath = Path.Combine(rootDirectory, $"{record.Uri}.{record.Extension}");
        ocrFilePath = Path.Combine(rootDirectory, $"{record.Uri}.txt");
        FileHelper.Delete(extractedFilePath);
        FileHelper.Delete(ocrFilePath);
        // fetch document
        record.GetDocument(extractedFilePath, false"OCR"string.Empty);
        // get the OCR text
        ocrFilePath = TextExtractor.ExtractFromFile(extractedFilePath);
        // use record extension method that removes existing OCR rendition (if exists)
        record.AddOcrRendition(ocrFilePath);
        record.Save();
        success = true;
    }
    catch (Exception ex)
    {
    }
    finally
    {
        FileHelper.Delete(extractedFilePath);
        FileHelper.Delete(ocrFilePath);
    }
    return success;
}
public static void UpdateOcrRenditions(TrimMainObjectSearch forTaggedObjects)
{
    foreach (var result in forTaggedObjects)
    {
        HP.HPTRIM.SDK.Record record = result as HP.HPTRIM.SDK.Record;
        if ((HP.HPTRIM.SDK.Record)record != null)
        {
            UpdateOcrRendition(record);
        }
    }
}
#endregion

I placed all of the tesseract logic into a new class named TextExtractor.  Within that class I have one method that takes a file name and returns the name of a file containing all of the ocr text.  If I use tesseract on a PDF though it will give me back the text layers from within the PDF, which defeats my goal.  I want tesseract to OCR the images within the PDF. 

To accomplish that I used the Xpdf command line utility pdftopng, which extracts all of the images to disk.  I then iterate over each image (just like I did within the original powershell script) to generate new OCR content.  As each image is processed the results are appended to an ocr text file.  That text file is what is returned to the record controller.

using CMRamble.Ocr.Util;
using System;
using System.Diagnostics;
using System.IO;
using System.Linq;
using Tesseract;
namespace CMRamble.Ocr
{
    public static class TextExtractor
    {
        /// <summary>
        /// Exports all images from PDF and then runs OCR over each image, returning the name of the file on disk holding the OCR results
        /// </summary>
        /// <param name="filePath">Source file to be OCR'd</param>
        /// <returns>Name of file containing OCR contents</returns>
        public static string ExtractFromFile(string filePath)
        {
            var ocrFileName = string.Empty;
            var extension = Path.GetExtension(filePath).ToLower();
            if (extension.Equals(".pdf"))
            {   
                // must break out the original images within the PDF and then OCR those
                var localDirectory = Path.Combine(Path.GetDirectoryName(filePath), Path.GetFileNameWithoutExtension(filePath));
                ocrFileName = Path.Combine(Path.GetDirectoryName(filePath), Path.GetFileNameWithoutExtension(filePath) + ".txt");
                FileHelper.Delete(ocrFileName);
                // call xpdf util pdftopng passing PDF and location to place images
                Process p = new Process();
                p.StartInfo.UseShellExecute = false;
                p.StartInfo.RedirectStandardOutput = true;
                p.StartInfo.FileName = "pdftopng";
                p.StartInfo.Arguments = $"\"{filePath}\" \"{localDirectory}\"";
                p.Start();
                string output = p.StandardOutput.ReadToEnd();
                p.WaitForExit();
                // find all the images that were extracted
                var images = Directory.GetFiles(Directory.GetParent(localDirectory).FullName, "*.png").ToList();
                foreach (var image in images)
                {
                    // spin up an OCR engine and have it dump text to the OCR text file
                    using (var engine = new TesseractEngine(@"./tessdata""eng"EngineMode.Default))
                    {
                        using (var img = Pix.LoadFromFile(image))
                        {
                            using (var page = engine.Process(img))
                            {
                                File.AppendAllText(ocrFileName, page.GetText() + Environment.NewLine);
                            }
                        }
                    }
                    // clean-up as we go along
                    File.Delete(image);
                }
            }
            return ocrFileName;
        }
    }
}

All done!  I can now compile the add-in and play with it.  First I added the menu links to my home ribbon.  As you can see below, clicking the remove ocr rendition link changes the number of renditions available.

2017-11-14_8-54-24.gif

Along the same line, if I click update ocr rendition then the number of renditions is increased...

2017-11-14_8-59-56.gif

In the next post I'll incorporate the same functionality within an event processor plugin, so that all records have their content OCR'd via tesseract.  

You can download the full source for this solution here: 

https://github.com/HPECM/Community/tree/master/CMRamble/Ocr