Generating Custom PDF Thumbnails

I noticed that many of the JFK documents have a meta-data cover sheet.  As shown in the image below, more than half include one.  In my custom user interface I'm wanting to embed a thumbnail of the document when the user is perusing search results.  These cover sheets render my thumbnails useless.  

2017-11-03_6-46-23.png

Powershell to the rescue again!  First I crafted a script that simply extracted the first page of each PDF....

$sourceFolder = "F:\Dropbox\CMRamble\JFK\docs_done"
$pngRoot = "F:\Dropbox\CMRamble\JFK\docs_done\pngs\"
$files = Get-ChildItem -Path $sourceFolder -Filter "*.pdf" | Where-Object { $_.Name.EndsWith('-ocr.pdf') -eq $false } 
foreach ( $file in $files ) 
{
    $fileName = "$($sourceFolder)\$($file.Name)"
    &pdftopng -f 1 -l 1 -r 300 "$fileName" "$pngRoot"
    Get-ChildItem -Path $pngRoot -Filter "*.png" | ForEach-Object { 
        $newImageName = "$($sourceFolder)\$([System.IO.Path]::GetFileNameWithoutExtension($fileName))$($_.Name)"
        Move-Item $_.FullName $newImageName 
    }
}

After it finished running I was able to compare the documents.  Some of the documents do not have cover sheets.  It would be quickest to use the second page from all documents, but for those without cover sheets I find that the first page is more often a better representation.  

Example document without cover sheet -- I would rather see the first page in the thumbnail

Example document without cover sheet -- I would rather see the first page in the thumbnail

Example document with cover sheet -- I would rather see the second page in the thumbnail

Example document with cover sheet -- I would rather see the second page in the thumbnail

I found that most of the cover sheets were 100KB or less in size.  Although I'm sure a few documents without cover sheets might have their first page less than that size, I'm comfortable using this as a dividing line.  I'll update the powershell to process the second page if the first is too small.

2017-11-03_6-56-42.png

After I run it I get these results...

Thumbnails selected based on extracted image size (cover sheets are mostly white space and therefore smaller in size)

Thumbnails selected based on extracted image size (cover sheets are mostly white space and therefore smaller in size)

Sweet!  Last thing I need is a powershell script that I can run against my Content Manager records.  I'll have it find all PDFs, extract the original rendition, apply this thumbnail logic, and then attach a thumbnail rendition.  My user interface will then directly link to the thumbnail rendition.

Clear-Host
Add-Type -Path "D:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
$db = New-Object HP.HPTRIM.SDK.Database
$db.Connect
$tmpFolder = "$([System.IO.Path]::GetTempPath())\cmramble"
if ( (Test-Path $tmpFolder) -eq $false ) { New-Item -Path $tmpFolder -ItemType Directory }
Write-Progress -Activity "Generating PDF Thumbnails" -Status "Searching for Records" -PercentComplete 0
$records = New-Object HP.HPTRIM.SDK.TrimMainObjectSearch -ArgumentList $db, Record
$records.SearchString  = "extension:pdf"
$x = 0
foreach ( $result in $records ) {
    $x++
    Write-Progress -Activity "Generating PDF Thumbnails" -Status "Record # $($record.Number)" -PercentComplete (($x/$records.Count)*100)
    $record = [HP.HPTRIM.SDK.Record]$result
    $hasThumbnail = $false
    $localFileName = $null
    for ( $i = 0; $i -lt $record.ChildRenditions.Count; $i++ ) {
        $rendition = $record.ChildRenditions.getItem($i)
        if ( $rendition.TypeOfRendition -eq [HP.HPTRIM.SDK.RenditionType]::Thumbnail ) {
            #$rendition.Delete()
            #$record.Save()
            $hasThumbnail = $true
        } elseif ( $rendition.TypeOfRendition -eq [HP.HPTRIM.SDK.RenditionType]::Original ) {
            $extract = $rendition.GetExtractDocument()
            $extract.FileName = "$($record.Uri).pdf"
            $extract.DoExtract("$($tmpFolder)", $true, $false, $null)
            $localFileName = "$($tmpFolder)\$($record.Uri).pdf"
        }
    }
    #extract the original rendition
    if ( ($hasThumbnail -eq $false) -and ([String]::IsNullOrWhiteSpace($localFileName) -eq $false) -and (Test-Path $localFileName)) {
        #get a storage spot for the image(s)
        $pngRoot = "$($tmpFolder)\$($record.Uri)\"
        if ( (Test-Path $pngRoot) -eq $false ) { New-Item -ItemType Directory -Path $pngRoot | Out-Null }
        #extract the first image
        &pdftopng -f 1 -l 1 -r 300 "$localFileName" "$pngRoot" 2>&1 | Out-Null
        $firstPages = Get-ChildItem -Path $pngRoot -Filter "*.png"
	    foreach ( $firstPage in $firstPages ) {
            $newImageName = "$($tmpFolder)\$([System.IO.Path]::GetFileNameWithoutExtension($localFileName))$($firstPage.Name)"
		    if ( $firstPage.Length -le 102400 ) {
			    #get second page
			    Remove-Item $firstPage.FullName -Force
			    &pdftopng -f 2 -l 2 -r 300 "$fileName" "$pngRoot"
			    $secondPages = Get-ChildItem -Path $pngRoot -Filter "*.png" 
			    foreach ( $secondPage in $secondPages ) 
			    {
				    $record.ChildRenditions.NewRendition($secondPage.FullName, [HP.HPTRIM.SDK.RenditionType]::Thumbnail, "PSImaging PNG") | Out-Null
                    $record.Save() 
                    Remove-Item $secondPage.FullName -Force
			    }
		    } else {
                #use the first page
    		    $record.ChildRenditions.NewRendition($firstPage.FullName, [HP.HPTRIM.SDK.RenditionType]::Thumbnail, "PSImaging PNG") | Out-Null
                $record.Save()
                Remove-Item $firstPage.FullName -Force
            }
        }
        Remove-Item $pngRoot -Recurse -Force
        Write-Host "Generated Thumbnail for $($record.Number)"
    } else {
        Write-Host "Skipped $($record.Number)"
    }
    Remove-Item $localFileName
}

If I visit my user interface I can see the results first-hand:

2017-11-03_7-20-45.png

Generating Keywords for the JFK Archive

To recap where I'm at in this current series (JFK Archives):

  1. I've imported 6685 records into Content Manager.
  2. I've OCR'd each of the PDF's I've downloaded from NARA by using Adobe Acrobat Pro.
  3. I've attached the OCR'd PDF as the main document and added two renditions: the original PDF and an OCR plain text file.  
  4. For audio records the original rendition is the recording and the PDF becomes a "transcript" (re-captioned Rendition Type of Other1).

Now I'm exploring my results.  When using the standard thick client I've got to constantly click back and forth between different tabs in the view pane.  That approach works well in many other types of solutions, but for a "reading room" (of sorts) this archive is going to be challenging to work with. 

2017-11-02_5-38-40.gif

I wish the view pane was split horizontally so that I can have both preview & properties.  Notice that when I preview the PDF's I cannot select any of the text that was OCR'd.  I don't truly need to be able to select those words; but, if I could, I would know that those I could select were searchable.

2017-11-02_5-50-26.png

When I used Adobe Acrobat Pro to OCR these documents I left it with the default "Searchable Image" output option.  I could have selected "Editable Text and Images".  Doing so would have generated PDF's searchable PDF's where the text could be selected. 

It took almost one day for one computer to generate the OCR'd PDF's, and it did crash 3 times over that day.  Adobe gobbles up memory until it ultimately crashes, but I could pick up where it left off.  The thought of going through this process again isn't appealing.

If my goal is to be able to see the OCR'd text, I can take a look at the OCR text rendition.  Unfortunately this requires me to open the properties dialog, click onto the rendition tab, find the rendition and select it.  Plus I'm finding the contents less and less appealing with each viewing.

I don't think that is English

I don't think that is English

The quality of the original scans isn't all that great and is often illegible.  Why can't I just strip out all of the noise and maybe generate a count of unique words?  I think I'll take a stab at extracting all of the unique words from the OCR'd text, sort by most frequently used first, and then save the results into a new property named "NARA Content Keywords".  

Words are ordered in from most frequent to least frequent

Words are ordered in from most frequent to least frequent

I think I should exclude all "words" that are just a single character.  I see a few underscores, so I'll need to filter those out as well.  Might as well add a little more spacing between the words while I'm at it.

2017-11-02_7-33-19.png

Each of these words is content searchable.  I picked a weird word "otential" (probably should have been "potential") and sure enough it came back in a content search.  Some of these words should be excluded and considered to be noise.  So I filter those out and now have something more interesting to review.

2017-11-02_7-44-12.png

As I ran my powershell script to do this on all of my records I noticed another issue: Adobe Acrobat generated a 0 byte OCR text file.  When I opened the output from Adobe there were no words in the "OCR'd PDF" either.  Talk about frustrating!  

2017-11-02_8-56-18.png

I can still get what I need though, I just need to be a bit more creative.  Since I'm in powershell and working with these records, I can leverage two free, open-source libraries: pdftopng and PSImaging.  I can use pdftopng to extract all of the images from a PDF and PSImaging to extract all text from those images.  Then I can organize my words and save them back to the record.  I'll create a second field so that I can compare the differences.

File system activity from the powershell script: extract PDF, extract images, push OCR into txt

File system activity from the powershell script: extract PDF, extract images, push OCR into txt

If I look at these word lists within Content Manager, I can see a stark difference.  I still need to do some refining on my noise list, but I'm excited to see some useful keywords surfacing from this effort.  One main reason for this is that PSImaging leverages tesseract-OCR (from google) and, in my experience, has better OCR results.

2017-11-02_9-32-33.png

Here's the powershell I used to generate this last round of keywords.

Clear-Host
Add-Type -Path "D:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
$db = New-Object HP.HPTRIM.SDK.Database
$db.Connect
Write-Progress -Activity "Generating OCR Keywords" -Status "Loading" -PercentComplete 0
#prep a temp spot to store OCR text files
$tmpFolder = "$([System.IO.Path]::GetTempPath())\cmramble"
if ( (Test-Path $tmpFolder) -eq $false ) { New-Item -Path $tmpFolder -ItemType Directory }
#prep word collection and word regex
$regex = [regex]"(\w+)"
$noiseWords = @("the", "to", "and", "subject", "or", "of", "is", "in", "be", "he", "that", "with", "was", "on", "have", "had", "as", "has", "at", "but", "no", "his", "these", "from", "any", "there")
$records = New-Object HP.HPTRIM.SDK.TrimMainObjectSearch $db, Record
$records.SearchString = "electronic"
$x = 0
foreach ( $result in $records ) 
{
    $x++
    $record = [HP.HPTRIM.SDK.Record]$result
    Write-Progress -Activity "Generating OCR Keywords" -Status "$($record.Number)" -PercentComplete (($x/$records.Count)*100)
    #fetch the record
    if ( $record -ne $null ) 
    {
	    for ( $i = 0; $i -lt $record.ChildRenditions.Count; $i++ ) 
	    {
		    $rendition = $record.ChildRenditions.getItem($i)
		    #find original rendition
		    if ( $rendition.TypeOfRendition -eq [HP.HPTRIM.SDK.RenditionType]::Original ) 
		    {
                $words = [ordered]@{}
			    #extract it
			    $extract = $rendition.GetExtractDocument()
			    $extract.FileName = "$($record.Uri).txt"
			    $extract.DoExtract("$($tmpFolder)", $true, $false, $null)
			    $localFileName = "$($tmpFolder)\$($record.Uri).pdf"
                #get a storage spot for the image(s)
                $pngRoot = "$($tmpFolder)\$($record.Uri)\"
                if ( (Test-Path $pngRoot) -eq $false ) { New-Item -ItemType Directory -Path $pngRoot | Out-Null }
                #extract images
                &pdftopng -r 300 "$localFileName" "$pngRoot" 2>&1 | Out-Null
                #generate OCR from each image
                $ocrTxt = "$([System.IO.Path]::GetDirectoryName($pngRoot))\$($record.Uri).txt"
                $files = Get-ChildItem $pngRoot | ForEach-Object {
                    Export-ImageText $_.FullName |  Add-Content $ocrTxt
                    Remove-Item $_.FullName -Force
                }
                Get-Content $ocrTxt | Where-Object {$_ -match $regex} | ForEach-Object {
				    $matches = $regex.Matches($_)
				    foreach ( $match in $matches ) 
				    {
                        if ( ($match -ne $null) -and ($match.Value.Length -gt 1) -and ($noiseWords.Contains($match.Value.ToLower()) -eq $false)  ) {
					        if ( $words.Contains($match.Value) ) 
					        {
						        $words[$match.Value]++
					        } else {
						        $words.Add($match.Value, 1)
					        }
                        }
				    }
			    }
			    #reorder words
			    $words = $words.GetEnumerator() | Sort-Object Value -Descending
			    $wordText = ''
			    #generate string of just the words (no counts)
			    $words | ForEach-Object { $wordText += ($_.Name + '  ') }
			    #stuff into CM
			    $record.SetFieldValue($db.FindTrimObjectByName([HP.HPTRIM.SDK.BaseObjectTypes]::FieldDefinition, "NARA OCR Keywords"), (New-Object HP.HPTRIM.SDK.UserFieldValue($wordText)))
				#replace OCR txt
				for ( $i = 0; $i -lt $record.ChildRenditions.Count; $i++ ) 
				{
					$rendition = $record.ChildRenditions.getItem($i)
					#remove any OCR
					if ( $rendition.TypeOfRendition -eq [HP.HPTRIM.SDK.RenditionType]::Ocr ) 
					{
						$rendition.Delete()
					}
				}
				$record.ChildRenditions.NewRendition($ocrTxt, [HP.HPTRIM.SDK.RenditionType]::Ocr, "OCR") | Out-Null
				$record.Save() | Out-Null
                Remove-Item $ocrTxt -Force
                Remove-Item $pngRoot -Force
                Remove-Item -Path $localFileName
		    }
	    }
    }
}