Generating Keywords for the JFK Archive

To recap where I'm at in this current series (JFK Archives):

  1. I've imported 6685 records into Content Manager.
  2. I've OCR'd each of the PDF's I've downloaded from NARA by using Adobe Acrobat Pro.
  3. I've attached the OCR'd PDF as the main document and added two renditions: the original PDF and an OCR plain text file.  
  4. For audio records the original rendition is the recording and the PDF becomes a "transcript" (re-captioned Rendition Type of Other1).

Now I'm exploring my results.  When using the standard thick client I've got to constantly click back and forth between different tabs in the view pane.  That approach works well in many other types of solutions, but for a "reading room" (of sorts) this archive is going to be challenging to work with. 


I wish the view pane was split horizontally so that I can have both preview & properties.  Notice that when I preview the PDF's I cannot select any of the text that was OCR'd.  I don't truly need to be able to select those words; but, if I could, I would know that those I could select were searchable.


When I used Adobe Acrobat Pro to OCR these documents I left it with the default "Searchable Image" output option.  I could have selected "Editable Text and Images".  Doing so would have generated PDF's searchable PDF's where the text could be selected. 

It took almost one day for one computer to generate the OCR'd PDF's, and it did crash 3 times over that day.  Adobe gobbles up memory until it ultimately crashes, but I could pick up where it left off.  The thought of going through this process again isn't appealing.

If my goal is to be able to see the OCR'd text, I can take a look at the OCR text rendition.  Unfortunately this requires me to open the properties dialog, click onto the rendition tab, find the rendition and select it.  Plus I'm finding the contents less and less appealing with each viewing.

I don't think that is English

I don't think that is English

The quality of the original scans isn't all that great and is often illegible.  Why can't I just strip out all of the noise and maybe generate a count of unique words?  I think I'll take a stab at extracting all of the unique words from the OCR'd text, sort by most frequently used first, and then save the results into a new property named "NARA Content Keywords".  

Words are ordered in from most frequent to least frequent

Words are ordered in from most frequent to least frequent

I think I should exclude all "words" that are just a single character.  I see a few underscores, so I'll need to filter those out as well.  Might as well add a little more spacing between the words while I'm at it.


Each of these words is content searchable.  I picked a weird word "otential" (probably should have been "potential") and sure enough it came back in a content search.  Some of these words should be excluded and considered to be noise.  So I filter those out and now have something more interesting to review.


As I ran my powershell script to do this on all of my records I noticed another issue: Adobe Acrobat generated a 0 byte OCR text file.  When I opened the output from Adobe there were no words in the "OCR'd PDF" either.  Talk about frustrating!  


I can still get what I need though, I just need to be a bit more creative.  Since I'm in powershell and working with these records, I can leverage two free, open-source libraries: pdftopng and PSImaging.  I can use pdftopng to extract all of the images from a PDF and PSImaging to extract all text from those images.  Then I can organize my words and save them back to the record.  I'll create a second field so that I can compare the differences.

File system activity from the powershell script: extract PDF, extract images, push OCR into txt

File system activity from the powershell script: extract PDF, extract images, push OCR into txt

If I look at these word lists within Content Manager, I can see a stark difference.  I still need to do some refining on my noise list, but I'm excited to see some useful keywords surfacing from this effort.  One main reason for this is that PSImaging leverages tesseract-OCR (from google) and, in my experience, has better OCR results.


Here's the powershell I used to generate this last round of keywords.

Add-Type -Path "D:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
$db = New-Object HP.HPTRIM.SDK.Database
Write-Progress -Activity "Generating OCR Keywords" -Status "Loading" -PercentComplete 0
#prep a temp spot to store OCR text files
$tmpFolder = "$([System.IO.Path]::GetTempPath())\cmramble"
if ( (Test-Path $tmpFolder) -eq $false ) { New-Item -Path $tmpFolder -ItemType Directory }
#prep word collection and word regex
$regex = [regex]"(\w+)"
$noiseWords = @("the", "to", "and", "subject", "or", "of", "is", "in", "be", "he", "that", "with", "was", "on", "have", "had", "as", "has", "at", "but", "no", "his", "these", "from", "any", "there")
$records = New-Object HP.HPTRIM.SDK.TrimMainObjectSearch $db, Record
$records.SearchString = "electronic"
$x = 0
foreach ( $result in $records ) 
    $record = [HP.HPTRIM.SDK.Record]$result
    Write-Progress -Activity "Generating OCR Keywords" -Status "$($record.Number)" -PercentComplete (($x/$records.Count)*100)
    #fetch the record
    if ( $record -ne $null ) 
	    for ( $i = 0; $i -lt $record.ChildRenditions.Count; $i++ ) 
		    $rendition = $record.ChildRenditions.getItem($i)
		    #find original rendition
		    if ( $rendition.TypeOfRendition -eq [HP.HPTRIM.SDK.RenditionType]::Original ) 
                $words = [ordered]@{}
			    #extract it
			    $extract = $rendition.GetExtractDocument()
			    $extract.FileName = "$($record.Uri).txt"
			    $extract.DoExtract("$($tmpFolder)", $true, $false, $null)
			    $localFileName = "$($tmpFolder)\$($record.Uri).pdf"
                #get a storage spot for the image(s)
                $pngRoot = "$($tmpFolder)\$($record.Uri)\"
                if ( (Test-Path $pngRoot) -eq $false ) { New-Item -ItemType Directory -Path $pngRoot | Out-Null }
                #extract images
                &pdftopng -r 300 "$localFileName" "$pngRoot" 2>&1 | Out-Null
                #generate OCR from each image
                $ocrTxt = "$([System.IO.Path]::GetDirectoryName($pngRoot))\$($record.Uri).txt"
                $files = Get-ChildItem $pngRoot | ForEach-Object {
                    Export-ImageText $_.FullName |  Add-Content $ocrTxt
                    Remove-Item $_.FullName -Force
                Get-Content $ocrTxt | Where-Object {$_ -match $regex} | ForEach-Object {
				    $matches = $regex.Matches($_)
				    foreach ( $match in $matches ) 
                        if ( ($match -ne $null) -and ($match.Value.Length -gt 1) -and ($noiseWords.Contains($match.Value.ToLower()) -eq $false)  ) {
					        if ( $words.Contains($match.Value) ) 
					        } else {
						        $words.Add($match.Value, 1)
			    #reorder words
			    $words = $words.GetEnumerator() | Sort-Object Value -Descending
			    $wordText = ''
			    #generate string of just the words (no counts)
			    $words | ForEach-Object { $wordText += ($_.Name + '  ') }
			    #stuff into CM
			    $record.SetFieldValue($db.FindTrimObjectByName([HP.HPTRIM.SDK.BaseObjectTypes]::FieldDefinition, "NARA OCR Keywords"), (New-Object HP.HPTRIM.SDK.UserFieldValue($wordText)))
				#replace OCR txt
				for ( $i = 0; $i -lt $record.ChildRenditions.Count; $i++ ) 
					$rendition = $record.ChildRenditions.getItem($i)
					#remove any OCR
					if ( $rendition.TypeOfRendition -eq [HP.HPTRIM.SDK.RenditionType]::Ocr ) 
				$record.ChildRenditions.NewRendition($ocrTxt, [HP.HPTRIM.SDK.RenditionType]::Ocr, "OCR") | Out-Null
				$record.Save() | Out-Null
                Remove-Item $ocrTxt -Force
                Remove-Item $pngRoot -Force
                Remove-Item -Path $localFileName