# Training Auto-Classification Models for RIM Bot

To use the RIM Bot to [auto-classify documents in the Document Inbox](/en/lr/51809/), a _Trained Model_ must be trained and deployed. This training allows the machine learning model to learn from your inputs, preparing it to intelligently process data.

Vault automatically creates a _Trained Model_ record of the _Document Classification_ type in all RIM Vaults with 1,500 or more Steady state documents. As long as a _Trained Model_ is not already deployed, Vault also deploys the model for you. In every Vault, custom and system-trained models refresh with each Vault release.

This process occurs once per release, so if you wish to update your _Trained Model_ at any time (for example, to reflect new document types, or to attempt to improve your results), you must follow the process described here to train, evaluate, and deploy it.

## How Auto-trained Models Work


<div class="note-border alert-info">
  <div class="alert alert-info" role="alert">
    <div><i class="far fa-info-circle"></i></div>
    <div class="alert-text">
      <p><strong>Note</strong>: RIM Bot is automatically enabled in all RIM Vaults with 1,500 or more Steady state documents. While we highly recommend allowing Vault to automatically train the related document classification models as described here, you can opt-out as required by inactivating the <em>Auto-Train Models</em> job.</p>
    </div>
  </div>
</div>


Vault uses the following process to create, train, and deploy a Document Classification _Trained Model_:

1. The _Auto-Train Models_ job runs each night at 1:00am EST on all production and pre-release[^1] Vaults. This job will check that:
    * No system-created model has been created since the last major release
    * There are at least 1,500 Steady state documents in this Vault
2. The job creates a _Trained Model_ of the _Document Classification_ type with the below default values. If your previously-deployed model included _Excluded Classification_ records or criteria in the _Document Criteria - VQL_ field, Vault copies them to the new trained model.
    * _Prediction Confidence Threshold_: 0.85
    * _Minimum Documents per Document Type_: 10
    * _Auto-Deploy_: Yes (true)
3. The latest document versions (based on the _Version Created Date_) that fall into the following categories are used to train this model:
    * In a Steady state (_Approved_/_Final_)
    * Not a Binder
    * Not in an [unsupported](/en/lr/51809/#auto-classification-limitations) document type
    * The document has pages
4. If there is already a deployed model of the same Trained Model Type in this Vault, the auto-trained model stays in the training state, otherwise it will be automatically deployed after it finishes training.
5. Once the auto-trained model is deployed, any documents uploaded to the Inbox may be auto-classified by the RIM Bot.


<div class="note-border alert-info">
  <div class="alert alert-info" role="alert">
    <div><i class="far fa-info-circle"></i></div>
    <div class="alert-text">
      <p><strong>Note</strong>: Users cannot edit the <em>Prediction Confidence Threshold</em> for system-trained models.</p>
    </div>
  </div>
</div>


Auto-classification models are automatically refreshed with each Vault general release. If you are currently using a manually-trained model, Vault makes a copy of that model, trains it, and deploys it to replace the old model. Otherwise, if you are currently using the previous release's auto-trained model, Vault deploys this new model to replace the old Trained Model. This ensures the system is training on the latest documents that represent your document hierarchy.

The RIM Bot has two options for sourcing data to train its model. For Production Vaults, it uses document content from the same Vault. In non-Production Vaults, administrators can choose to use content from either the Production Vault or the same non-Production Vault. In all cases, the training data and AI model are fully encapsulated within the customer's environment, just as documents and data are completely separated among customers.

Given the number of RIM customers, auto-training and deploying models in your Vault may take 48 to 72 hours after each general release.

[^1]: Pre-release Vaults will use the production Vault's documents to auto-train the model.

## How To Train a Model {#how-to-train-a-model}

Like all machine learning tools, the RIM Bot requires input to learn before performing tasks on its own. Generally, the larger and more accurate the inputs, the better the resulting model will be. Vault stores accumulated input in _Trained Model_ object records.

### Prediction Confidence

Vault uses a _Prediction Confidence_ score to indicate how certain RIM Bot is that its prediction is correct. This value is between 0 (likely wrong) and 1 (likely correct). The better your inputs, the higher the _Prediction Confidence_ will be. Vault stores _Prediction Confidence_ scores in [_Prediction_][2] object records.

### Prediction Confidence Threshold

Vault uses the _Prediction Confidence Threshold_ field value on a _Trained Model_ record to determine what score is required before the model can use that _Prediction_. The _Prediction Confidence Threshold_ is system-managed and set to 0.85 by default. This means that if the _Prediction Confidence_ for a document uploaded to the Document Inbox is .8728, Vault auto-classifies the document.

## Creating a Document Classification _Trained Model_

Before creating a _Trained Model_, carefully consider the following limitations:

* Vault allows Admins to train models in <a class="external-link " href="https://rn.veevavault.help/en/gr/pre-release-faq/" target="_blank" rel="noopener">Pre-release<i class="fa fa-external-link" aria-hidden="true"></i></a> or [Sandbox](/en/lr/48988/) environments using their production environment documents, verifying the training process. These models, however, cannot be moved to your production Vault, so _Trained Models_ must be created and trained in the production environment as well.
* Certain categories of documents cannot be auto-classified or used in model training. These include:
    * Audio and video files
    * Non-text files, such as ZIP files, statistical files, or database files
    * Non-English files. When enabled, Vault can train on and detect key metadata in documents in non-English, [Vault-supported languages](/en/lr/16678/). Contact <a class="external-link " href="https://support.veeva.com/hc/en-us" target="_blank" rel="noopener">Veeva Support<i class="fa fa-external-link" aria-hidden="true"></i></a> or your Veeva Services representative to enable the Multilingual Model feature.
    * Documents where Vault cannot extract text, for example, if the text is too blurry or if the file is password-protected or encrypted.
* We recommend using at least 3,000 documents in steady states, such as _Approved_ or _Final,_ to train the machine learning model. You may use RIM Bot on Vaults with 1,000 to 3,000 documents, however, this may limit the quality of your predictions.
* If any inputs are misclassified documents, predictions may be negatively impacted. For example, if several documents that should have been classified as _Regulatory > Correspondence > Approval Letter_ were classified as _Regulatory > Correspondence > Agency Decisions_, RIM Bot will be less confident about predictions for those document types.
* The following fields are system managed and set with default values: _Prediction Confidence Threshold_: 0.85; _Minimum Documents per Type_: 10; _Auto-Deploy_: Yes (true)

### Creating the _Trained Model_ Object Record {#enabling-and-creating-the-trained-model-object-record}

1. Navigate to **Admin > Configuration > Document Fields** and review your Vault's configuration for the _RIM Auto Classification_ and _Tags_ fields. In order for users to [observe the auto-classification process](/en/lr/51809/#how-the-rim-bot-auto-classifies-documents) in their Document Inbox:
    * The _RIM Auto Classification_ field must be configured for each document type to be auto-classified. This includes the _Unclassified_ document type. 
    * Field-level security for the _Tags_ field must be configured as Read Only or Editable.
2. Navigate to **Admin > Business Admin** and click into the _Trained Model_ object.
3. Click **Create**.
4. For the **Trained Model Type**, select **Document Classification**.
5. The **Prediction Confidence Threshold** is system managed and is set to 0.85 by default. You do not need to enter any value in this field.
6. Set the [**Training Window Start Date**][5] accordingly.
7. Click **Save**.

After creating the _Trained Model_ object record, optionally add any [Excluded Classifications][4], then [train the model][5].

### Creating Excluded Classifications {#creating-excluded-classifications}

You can define classifications that will be excluded from your Trained Model. The RIM Bot excludes the specified classification(s) from all extraction, training, and testing during model deployment. Additionally, later predictions the RIM Bot makes are not actioned if a document is in (or predicted to be in) an excluded classification.

You can specify excluded classifications before or after a model is trained. If you add an excluded classification after the model's training, the model is not automatically retrained. However, the RIM Bot does not take any action against documents of the excluded classification.

This exclusion applies only to the Trained Model to which the Excluded Classification belongs. If you create an Excluded Classification for a Trained Model which is no longer in use, you must re-define it for the currently-deployed model.

To create an excluded classification:

1. Under **Excluded Classifications**, click **Create**.
2. Select the **Status** of the Excluded Classification.
3. Select the **Classification** you wish to exclude.
4. Enter any relevant **Comments**.
5. Click **Save**.

### Training the _Trained Model_ {#training-the-trained-model}

Once you have [created][6] the _Trained Model_, perform the **Train Model** action and click **Start**. The _Trained Model_ record moves to the _In Training_ state.

To train your model, Vault sets the _Training Window Start Date_ and pulls all non-Archived documents in a Steady State, such as Approved or Final, with a _Version Created Date_ value between the _Training Window Start Date_ and the current date. If there are more than 200,000 documents that fit this criteria, Vault uses the 200,000 most recent documents.

Additionally, an asynchronous job tracks two activities as part of training:

1. **Document Extraction**: During this process, the system collects the data from the document set. The output is a CSV file (`document_extract_results.csv`) in which an Admin can see which documents were able to be used as input and which were not attached under _Trained Model Artifacts_. Vault sends a notification to the Admin who started the action when the extraction is complete.
2. **Model Training**: During this process, the system will use 80% of the extracted data to build a machine learning neural network model, then test that model using the remaining 20%. The output is a number of [performance metrics](/en/lr/518092/) in both the _Trained Model Performance Metrics_ object and attached CSVs under _Trained Model Artifacts_. Vault sends a notification to the Admin who started the action when training is complete.

The time required to complete these jobs varies depending on the number of documents used as input: About 1 hour for Vaults training on 3,000 documents, to about 24 hours for Vaults training on 200,000 documents.

Once model training is complete, the _Trained Model_ record moves to the _Trained_ state.

### Training a _Trained Model_ in Pre-Release or Sandbox Environments with Production Data

You can train a _Trained Model_ in your Pre-Release or Sandbox Vault with production documents for evaluation purposes. You cannot move the resulting _Trained Model_ to your production environment.

To train using production data, run the **Train Model From Production Data** action. This action is only visible in Pre-Release and Sandbox Vaults.

After evaluating your _Trained Model_, you'll need to perform training again in your production Vault to begin using RIM Bot features there.

### Evaluating the _Trained Model_

Vault provides key metrics you can reference in the _Trained Model_ record's _Training Summary Results_ field to evaluate your model: Extraction Coverage, Auto-classification Coverage, and Auto-classification Error Rate. See the [definitions for these metrics and how to improve them](/en/lr/518092/#auto-classification-evaluation-key-metrics).

### Deploying the _Trained Model_

Once you have trained and evaluated your _Trained Model_, select the **Deploy Model** action from the _Trained Model_ record, review the prompt to ensure you agree with the outcome and click **Start**. The _Trained Model_ record will move to the In Deployment state.

An asynchronous job tracks the deployment of this _Trained Model_ in your Vault. The time required to complete these jobs varies, and it can take anywhere from 30 minutes to two hours. Vault sends a notification to the Admin who performed the action when deployment is complete.

Once the deployment job finishes, the _Trained Model_ record moves to the _Deployed_ state and Vault begins auto-classifying the documents in the Document Inbox.

Only one _Trained Model_ per _Trained Model Type_ can be deployed at a time.


<div class="note-border alert-info">
  <div class="alert alert-info" role="alert">
    <div><i class="far fa-info-circle"></i></div>
    <div class="alert-text">
      <p><strong>Note</strong>: Vault automatically refreshes deployed <em>Trained Models</em> every general release.</p>
    </div>
  </div>
</div>


#### Replacing a Deployed _Trained Model_

To replace a deployed model with a new _Trained Model_, simply deploy the new model. It replaces the currently active model, and auto-classification is not interrupted. This is the recommended method for replacing models.

#### Refreshing a Deployed _Trained Model_

To refresh a deployed model, select the **Refresh Model** action. It will automatically create a deep copy of the current _Trained Model_ and start the training process. This action prevents users from starting multiple training jobs simultaneously, and refreshes a _Trained Model_ in fewer steps.

If the multilingual model has been enabled in your Vault for the first time, [a new model must be trained](#how-to-train-a-model) rather than retraining the existing model or any model that was in use prior to the multilingual model feature being enabled.

### Additional _Trained Model_ Actions & Details

You can only have five _Trained Models_ per _Trained Model Type_. If you attempt to train a sixth, Vault advises you to archive a model before training another. To do so, select the **Archive Model** action on a _Trained Model_ record. The _Trained Model_ record moves to the _Archived_ state. Archived models are not recoverable.

You can also remove deployed models and disable auto-classification by using the **Withdraw Model** action on a _Trained Model_ in the _Deployed_ state. Doing so moves the _Trained Model_ record back to the _Trained_ state.

## About the _Prediction_ Object {#about-the-prediction-object}

When a _Trained Model_ is deployed and used to predict data for a document, the _Prediction_ object keeps track of each individual prediction attempt. It's unlikely that Admins will need to work with this object directly, but it may be useful to understand the object fields:

* **Prediction ID**: Unique identifier for that prediction, automatically assigned by Vault
* **Related Record Unique ID**: Identifier for the file being evaluated, automatically assigned by Vault
* **Related Record**: Metadata for the document being evaluated, formatted as JSON. You can locate the Vault Document ID, Major version, and Minor version here if needed.
* **Predictions**: The prediction data for this attempt from RIM Bot, formatted as JSON. You can use this field to understand if a prediction failed and why; which _Trained Model_ was used to make the prediction; and, in the case of Document Classification, the first, second, and third top predictions from the model along with their _Prediction Confidence_ scores. If the first Prediction score is above the deployed _Trained Model_ _Prediction Confidence Threshold_, the document will have been auto-populated with that prediction. This can also be seen with the auto-populated JSON parameter.
* **Feedback**: Post-prediction activity. This field shows the current value for the data being predicted in the trueValue JSON parameter and if that value matches the corresponding first Prediction in the Predictions field in the trueValueMatch JSON parameter.
* **Additional Details**: Lists from where Vault generates the prediction. This can include multiple sources.

## About the _Prediction Metrics_ Object

When a _Trained Model_ is deployed and used to predict data for a document, the _Prediction Metrics_ object keeps track of the model's performance over time. The _Prediction Metrics_ job runs monthly and generates records that track the overall _Trained Model_ performance, as well as performance per document classification.

You can view the following object fields from the _Trained Model_ page layout:

* **Model Performance ID**: Unique ID, assigned by Vault
* **Created Date**: Date the prediction metric was calculated
* **Trained Model Type**: The Trained Model Type being evaluated, for example Auto-Classification
* **Metric Type**: Metric type presented
* **Metric Subtype**: Subtype of the metric presented
* **Number of Documents**: The number of documents sent to the RIM Bot during the given time period.
* **Documents Extracted**: The number of documents sent to the RIM Bot that had text successfully extracted and evaluated by the model.
* **Extraction Rate**: The rate at which documents sent to the RIM Bot had their text successfully extracted (**Documents Extracted** divided by **Number of Documents**).
* **Documents with Predictions**: The number of documents with a predicted value. For Auto-Classification, these are all documents sent to the RIM Bot.
* **Correct Predictions**: The number of times the predicted value was accurate, whether or not the RIM Bot acted upon it. For Auto-Classification, the predicted classification is correct whether or not it is above the _Prediction Confidence Threshold_.
* **Predictions Above Threshold**: The number of times the RIM Bot acted upon the prediction. For Auto-Classification, this means the predicted classification was above the model's _Prediction Confidence Threshold_.
* **Correct Predictions Above Threshold**: The number of times the predicted value was accurate and the RIM Bot acted upon it. For Auto-Classification, this means RIM Bot set the correct classification on the model.
* **Success Rate**: The rate at which predictions on which the system acted were confirmed as true predictions (**Correct Predictions Above Threshold** divided by **Predictions Above Threshold**)

[1]: #training-window-start-date
[2]: #about-the-prediction-object
[3]: #choosing-a-document-set-method
[4]: #creating-excluded-classifications
[5]: #training-the-trained-model
[6]: #enabling-and-creating-the-trained-model-object-record