Piggy-backing a Modern Analytics Architecture

Let’s say you’re one of the many Federal Agencies, Regulated Industries, or State & Local Governments migrating their analytical infrastructure onto the secured GCP platform (those ISO 27001, ISO 27017, ISO 27018, FedRAMP, HIPAA, and GDPR certifications make this a no brainer!). You will soon have access to tools and advanced capabilities that will enable a more comprehensive usage of Content Manager’s audit, compliance, and governance features. More importantly, you can start leveraging technological investments driven by departments & cost centers that have significantly higher budgets than you may have in your Records Management Office.

Consider this simple architectural pattern:

Automated Audit & Compliance.png

I’m actively working with two large insurance companies that are in the process of implementing this exact architecture. For each of them it’s a multi-year, multi-million dollar, multi-department modernization effort that will both migrate their analytical infrastructure and transform how they work. There are far more components in their final architecture, but I’ve simplified the diagram to make things more clear in this blog post.

I find it exciting that both customers agreed that governance was one (of many) priority logical capabilities. To them, initially, governance is primarily focused on data quality, master data, loss prevention, and security. Content Manager brings a lot to the governance table though. By piggy-backing the analytical infrastructure we can leverage existing investments and tap into entirely new technologies. All we need to do is think outside of the on-premise mental box.

Regulatory Compliance Use Cases

Two key aspects of regulatory compliance are records management and business process integration. Once you move into a modern analytical architecture though, you no longer need to directly integrate Content Manager with the source systems. Now you can tap into data as it flows through the architecture.

Using a SaaS product hosted on AWS that publishes to a Kinesis stream? Not a problem! We can use a Lambda to immediately create a record in Content Manager so that compliance officers have immediate access to records!

Automated Audit & Compliance-Compliance (2).png

Or maybe you want to ensure record holds are annotated in the Warehouse in real-time? Also, not a problem! Apache Beam, the open-source framework backing Dataflow, makes it easy to pull in active record holds as a side input via the CM Service API.

Automated Audit & Compliance-Compliance - Record Holds (1).png

Or maybe you need to push Content Manager data into your environment so that you can mitigate operational risk by extracting entities & sentiment from textual underwriting notes (stored in CM)? GCP makes it simple! We just need to create a Cloud Composer DAG that pulls the content from CM, runs them through AutoML, submits a job to an auto-provisioned Hadoop cluster, run the PySpark model to predict the propensity a new policy will file a claim, submit the results to BigQuery for retrospective analysis, and automatically kick-off workflows for internal auditors to evaluate a random sampling of the model results!

Automated Audit & Compliance-Compliance - Entity Extraction (2).png

By just cobbling components together we can leverage the existing investment in the analytical infrastructure to create compelling Content Manager integrations that drive regulatory compliance. This can be a compelling low-code approach that delivers immediate results from your cloud investment. Just scratching the surface with the exciting things on the horizon!

Loosely Coupled Record Automation

I want to automate the creation of records in Content Manager so that workflows can be initiated and other content can be captured. As records and workflows are processed by users within Content Manager I want to export that activity to a graph database. That graph database will highlight relationships between records that might not otherwise be easily determined. It can also aide in tracing lineage and relationships.

I few years ago I would have created a lightweight C# console app that implements the logic and directly integrates with Content Manager via the .Net or COM SDK. No longer though! Now I want to implement this using as many managed services and server-less components as possible.

This diagram depicts the final solution design…

CM Graph (2).png

What does each component do?

  • Cloud Scheduler — an online cron/task utility that is cheap and easy to use on GCP

  • Cloud Functions — light-weight, containerized bundles of code that can be versioned, deployed, and managed independently of the other components

  • Cloud PubSub — this is a message broker service that allows you to quickly integrate software components together. One system may publish to a topic. Other systems (0+) will subscribe to those topics

  • Service API — REST API end-point that enables integration over HTTP

  • Content Manager Event Server — custom .Net Event Processer Plugin that publishes new record meta-data and workflow state to a PubSub topic

  • Graph Database — enables searching via cypher query syntax (think social network graph) across complex relationships

Why use this approach?

  • Centralized — Putting the scheduler outside of the CM server makes it easier to monitor centrally

  • Separation of Concerns — Separating the “Check Website” logic from the “Saving to CM” logic enables us re-use the logic for other purposes

  • Asynchronous Processing — Putting PubSub between the functions let’s them react in real-time and independently of each other

  • Scaling — cloud functions and pubsub can scale horizontally to billions of calls

  • Error handling — when errors happen in a function we can redirect to an error topic for review (which could kick-off a workflow)

  • Language Freedom — I can use python, node, or Go for the cloud functions; or I can use .Net (via Cloud Run instead of as a Cloud Functions)

Overall this is a pretty simple undertaking. It will grow much more complex as time progresses, but for now I can get building!

Fetching the records

This is super easy with python! My source is a REST API that will contain a bunch of data about firms. For each retrieved firm I’ll publish a message to a topic. Multiple things could then subscribe to that topic and react to the message.

First we’ll create the topic…


Next I write the logic in a python module…

import urllib.request as urllib2
import sys
import json
import requests
import gzip
import os
from google.cloud import pubsub
project_id = os.getenv('GOOGLE_CLOUD_PROJECT') if os.getenv('GOOGLE_CLOUD_PROJECT') else 'CM-DEV'
topic_name = os.getenv('GOOGLE_CLOUD_TOPIC') if os.getenv('GOOGLE_CLOUD_TOPIC') else 'new_firm'
def callback(message_future):
    # When timeout is unspecified, the exception method waits indefinitely.
    if message_future.exception(timeout=30):
        print('Publishing message on {} threw an Exception {}.'.format(
            topic_name, message_future.exception()))
def downloadFirms(args):
    request_headers = requests.utils.default_headers()
        'User-Agent''Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' 
    url = "https://api..."
    request = urllib2.Request(url, headers=request_headers)
    html = gzip.decompress(urllib2.urlopen(request).read()).decode('utf-8')
    data = json.loads(html)
        hits = data['hits']['hits']
        publisher = pubsub.PublisherClient()
        topic_path = publisher.topic_path(project_id, topic_name)
        for firm in hits:
            firm_data = firm['_source']
            firm_name = firm_data['firm_name']
            topic_message = json.dumps(firm_data).encode('utf-8')
            msg = publisher.publish(topic_path, topic_message)
    except Exception as exc:
        print('Error: ', exc)
if __name__ == "__main__":

Now that module can be placed into a cloud function…


Don’t forget to pass in the run-time parameters so that the function can post to the correct topic in the correct project. You may change these during your testing process.


With that saved we can now review the HTTP end-point address, which we’ll use when scheduling the routine download. Open the cloud function and click onto the trigger tab, then copy the URL to your clipboard.


In the cloud scheduler we just need to determine the frequency and the URL of the cloud function. I’ll post an empty json object as it’s required by the scheduler (even though I won’t consume it directly within the cloud function).


Next I need a cloud function that subscribes to the topic and does something with the data. For quick demonstration purposes I’ll just write the data out to the console (which will materialize in stackdriver as a log event).


With that created I can now test the download function, which should result in new messages in the topic, and then new output in Stackdriver. I can also create log metrics based on the content of the log. For instance, I can create a metric for number of new firms, number of errors, or average runtime execution duration (cloud functions cap out in terms of their lifetime, so this is important to consider).


Now I could just put the “create folder in CM” logic within my existing cloud function, but then I’m tightly-coupling the download of the firms to the registration of folders. That would limit the extent to which I can re-use code and cobble together new feature functionality. Tightly-coupled solutions are harder to maintain, support, and validate.

In the next post we’ll update the cloud function that pushes the firm into the Content Manager dataset!

Purging content after backing-up and then restoring a database

How do we create new datasets that contain just a portion of the content from an existing dataset? I’ve been asked this dozens of times over the years. Someone asked this past week and I figured I’d bundle up my typical method into a powershell script. This script can be run via a scheduled task, allowing you to routinely execute it for such purposes as refreshing development environments, exporting to less secured environments, or to curate training datasets.

Start by first installing the SQL Server powershell tools from an administrative powershell window by running this command:

Install-Module -Name SqlServer

You will be prompted to allow the install to continue.


It will then initiate the download and attempt to install.


In my case, I’ve already got it installed. If you do as well then you’ll see these types of errors, which are safe to ignore.


Now we can start scripting out the process. The first step is to start the backup of the existing database. You can use the Backup-SqlDatabase command. To make things easier we use variables for the key parameters (like instance name, database name, path to the backup file).

# Backup the source database to a new file
Write-Information "Backing Up $($sourceDatabaseName)"
Backup-SqlDatabase -ServerInstance $serverInstance -Database $sourceDatabaseName -BackupAction Database -CopyOnly -BackupFile $backupFilePath

Next comes the restore of that backup. Unfortunately I cannot use the Restore-SqlDatabase commandlet, because it does not support restoring over databases that have other active connections. Instead I’ll have to run a series of statements that set the database to single-user mode, restores the database (with relocated files), and then sets the database back to multi-user mode.

# Restore the database by first getting an exclusive lock with single user access, enable multi-user when done
Write-Warning "Restoring database with exclusive use access"
$ExclusiveLock = "USE [master]; ALTER DATABASE [$($targetDatabaseName)] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;"
$RestoreDatabase = "RESTORE FILELISTONLY FROM disk = '$($backupFilePath)'; RESTORE DATABASE $($targetDatabaseName) FROM disk = '$($backupFilePath)' WITH replace, MOVE '$($sourceDatabaseName)' TO '$($sqlDataPath)\$($targetDatabaseName).mdf', MOVE '$($sourceDatabaseName)_log' TO '$($sqlDataPath)\$($targetDatabaseName)_log.ldf', stats = 5; ALTER DATABASE $($targetDatabaseName) SET MULTI_USER;"
ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement ($ExclusiveLock+$RestoreDatabase)

Next we need to change the recovery mode from full to simple. Doing so will allow us to manage the growth of the log files. This is important because the delete commands we run will spool changes into the database logs, which we’ll need to shrink as often as possible (otherwise the DB server could potentially run out of space).

# Change recovery mode to simple
Write-Information "Setting recovery mode to simple"
Invoke-Sqlcmd -ServerInstance $serverInstance -Database $targetDatabaseName -Query "ALTER DATABASE $($targetDatabaseName) SET RECOVERY SIMPLE"

With the database in simple recovery mode we can now start purging content from the restored database. Before digging into the logic of the deletes, I’ll need to create a function I can call that traps errors. I’ll also want to be able to incrementally shrink, if necessary.

# Function below is used so that we trap errors and shrink at certain times
function ExecuteSqlStatement 
    param([String]$Instance, [String]$Database, [String]$SqlStatement, [bool]$shrink = $false)
    $error = $false
    # Trap all errors
    try {
        # Execute the statement
        Write-Debug "Executing Statement on $($Instance) in DB $($Database): $($SqlStatement)"
        Invoke-Sqlcmd -ServerInstance $Instance -Database $Database -Query $SqlStatement | Out-Null
        Write-Debug "Statement executed with no exceptions"
    } catch [Exception] {
        Write-Error "Exception Executing Statement: $($_)"
        $error = $true
    } finally {
    # When no error and parameter passed, shrink the DB (can slow process)
    if ( $error -eq $false -and $shrink -eq $true ) {
        ShrinkDatabase -SQLInstanceName $serverInstance -DatabaseName $Database -FileGroupName $databaseFileGroup -FileName $databaseFileName -ShrinkSizeMB 48000 -ShrinkIncrementMB 20

To implement the incremental shrinking I use a method I found a few years ago, linked here. It’s great as it works around the super-slow shrinking process when done on very large datasets. Your database administrator should pay close attention to how it works and align it with your environment.

My goal is to remove all records of certain types, so that they aren’t exposed in the restored copy. Unfortunately the out-of-the-box constraints do not cascade and delete related objects. That means we need to delete them before trying to delete the records.

We need to delete:

  • Workflows (and supporting objects)

  • Records (and supporting objects)

  • Record Types

In very large datasets this process could take hours. You can optimize the performance by adding a “-shrink $true” parameter to any of the delete statements that impact large volumes of data in your org (electronic revisions, renditions, locations for instance).

# Purging content by record type uri
foreach ( $rtyUri in $recTypeUris ) 
    Write-Warning "Purging All Records & Supporting Objects for Record Type Uri $($rtyUri)"
    Write-Information " - Purging Workflow Document References"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tswkdocusa where wduDocumentUri in (select uri from tswkdocume where wdcRecordUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri))"
    Write-Information " - Purging Workflow Documents"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tswkdocume where wdcRecordUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri)"
    Write-Information " - Purging Workflow Activity Start Conditions"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tswkstartc where wscActivityUri in (select uri from tswkactivi where wacWorkflowUri in (select uri from tswkworkfl where wrkInitiator in (select uri from tsrecord where rcRecTypeUri=$rtyUri)))"
    Write-Information " - Purging Workflow Activities"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tswkactivi where wacWorkflowUri in (select uri from tswkworkfl where wrkInitiator in (select uri from tsrecord where rcRecTypeUri=$rtyUri))"
    Write-Information " - Purging Workflows"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tswkworkfl where wrkInitiator in (select uri from tsrecord where rcRecTypeUri=$rtyUri)"
    Write-Information " - Purging Communications Detail Words"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tstranswor where twdTransDetailUri in (select tstransdet.uri from tstransdet inner join tstransmit on tstransdet.tdTransUri = tstransmit.uri inner join tsrecord on tstransmit.trRecordUri = tsrecord.uri where tsrecord.rcRecTypeUri = $rtyUri);"
    Write-Information " - Purging Communications Details"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tstransdet where tdTransUri in (select tstransmit.uri from tstransmit inner join tsrecord on tstransmit.trRecordUri = tsrecord.uri where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Communications"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tstransmit where trRecordUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Thesaurus Terms"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrecterm where rtmRecordUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Relationships"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsreclink where rkRecUri1 in (select uri from tsrecord where rcRecTypeUri=$rtyUri) OR rkRecUri2 in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Actions"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrecactst where raRecordUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Jurisdictions"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrecjuris where rjRecordUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Requests"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrecreque where rqRecordUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Rendition Queue"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrendqueu where rnqRecUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Renditions"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrenditio where rrRecordUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Revisions"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tserecvsn where evRecElecUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Documents"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrecelec where uri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Holds"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tscasereco where crRecordUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Record Locations"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrecloc where rlRecUri in (select uri from tsrecord where rcRecTypeUri=$rtyUri);"
    Write-Information " - Purging Records (shrink after)"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrecord where rcRecTypeUri = $rtyUri" -shrink $true
    Write-Information " - Purging Record Types"
    ExecuteSqlStatement -Instance $serverInstance -Database $targetDatabaseName -SqlStatement "delete from tsrectype where uri = $rtyUri"

With that out of the way we can now restore the recovery mode back to full.

# Change recovery mode to simple
Write-Information "Setting recovery mode to full"
Invoke-Sqlcmd -ServerInstance $serverInstance -Database $targetDatabaseName -Query "ALTER DATABASE $($targetDatabaseName) SET RECOVERY FULL"

Last step is to restart the workgroup service on the server using this restored database.

# Restart CM
Write-Warning "Restarting Content Manager Service"
Restart-Service -Force -Name $cmServiceName

At the top of my script I have all of the variables defined. To make this work for you, you’ll need to adjust the variables to align with your environment. For instance, you’ll need to update the rtyUris array to contain the URIs of those record types you want to have deleted.

# Variables to be used throughout the script
$serverInstance = "localhost"                   # SQL Server instance name 
$sourceDatabaseName = "CMRamble_93"             # Name of database in SQL Server
$targetDatabaseName = "Restored_cmramble_93"    # Name to restore to in SQL Server
$backupFileName = "$($sourceDatabaseName).bak"  # File name for backup of database
$backupPath = "C:\\temp"                        # Folder to back-up into (relative to server)
$backupFilePath = [System.IO.Path]::Combine($backupPath, $backupFileName)   # Use .Net's path join which honors OS
$databaseFileGroup = "PRIMARY"                  # File group of database content, used when shrinking
$databaseFileName = "CMRamble_93"               # Filename within file group, used when shrinking
$sqlDataPath = "D:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA"   # Path to data files, used in restore
$cmServiceName = "TRIMWorkgroup"                # Name of registered service in OS (not display name)
$recTypeUris = @( 3 )                           # Array of uri's for record types to be purged after restore

To find the URIs of your record types, you’ll need to customize your view pane and add in the unique identifier property.


Running the script gives me this output….


On my local machine it took ~1 minute to complete for a super small dataset. When running this on a 70 GB file, with me removing approximately 20 GB of content, it takes 15 minutes. Though my SQL Server has 64 GB of RAM and two SDD drives that hold the SQL Server data & log files.

You can download my full script here: https://github.com/aberrantCode/cm_db_restore_and_purge