NAV Navbar
shell

Definitions

Introduction

The GLYNT API is a RESTful API that makes it easy to upload documents and extract clean, labelled data. If you login at https://api.glynt.ai with your user credentials (provided by your GLYNT customer representative), then you will be able to browse the API interactively to view and edit your data. The base URL of the api is https://api.glynt.ai/v6/. This is referred to as the api_base_url.

All data uploaded to or created by the API is segmented by Data Pool. Your organization may have one or more Data Pools. A production Data Pool and a stage Data Pool will be created for your organization by your GLYNT representative, but additional Data Pools can be created on request. These Data Pools are completely separate environments. The ID of the Data Pool to be interacted with is passed in the URL of every request to the API like so: <api_base_url>/data-pools/<data_pool_id>/. This is referred to as the datapool_url, and serves as the base URL for all endpoints outside of authorization.

To get started, you must first provide training data to your GLYNT customer representative (outside of the API) so that the machine learning models can be prepared for your unique document types. Your customer representative can tell you what they need, give you an estimate of how long training will take, and will let you know when it is available for extractions. You will also be able to see the prepared Training Sets using the /training-sets/ endpoints.

Once your Training Sets have been created and made available on your Data Pool(s), you're ready to begin interacting with the GLYNT API, using standard REST endpoints to interact with various resources. To begin a session, authenticate with the API to obtain an access token as per the Auth section of the docs, below.

With your token in hand, you're ready to begin submitting data for extractions. The most common workflow will be:

  1. Upload Documents with one or several POSTs to the /documents/ endpoint, and subsequent PUTs to the temporary file_upload_urls.
  2. Initiate an Extraction Batch against the uploaded Documents with a POST to the /extraction-batches/ endpoint, passing the IDs of the recently uploaded Documents and the ID of the Training Set to extract against. This POST initiates the Extraction Batch job.
  3. Poll the /extraction-batches/ endpoint with GET requests using the Extraction Batch ID returned in step 2 until all results are available. Polling about once per minute is a reasonable default.
  4. Download the results for each Extraction of the finished Extraction Batch using the /extractions/ endpoint.

You can use your user credentials directly to interact with your data and experiment with the system. Machine-to-Machine integrations are also supported. See the Machine to Machine Flow section below for more information.

Maintenance

Occasionally, the GLYNT API will go down for scheduled maintenance. In such case, all API endpoints will return a 503 response.

Auth

There are two methods of authenticating and authorizing with the API, one for users and one for machine-to-machine integrations. In both the user and M2M flows, the result will be an access token which is issued to the requesting party.

Access tokens are valid for 12 hours. Refresh tokens are not supported at this time.

The access token is passed with all further requests to the API using the Authorization header, like so:

Authorization: <token_type> <access_token>

User Flow

To retrieve an access token using the User Flow:

curl --request POST \
     --url 'https://api.glynt.ai/v6/auth/get-token/' \
     --header 'content-type: application/json' \
     --data '{"username":"<YOUR_USERNAME>","password":"<YOUR_PASSWORD>"}'

Make sure to replace <YOUR_USERNAME> and <YOUR_PASSWORD> with the values provided to you by your GLYNT representative. This command will return JSON structured like this:

{
  "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6IlFqWXdSRFUzUkRrNFFVWXhPRVE0UTBZNE5VWXhSalV3T0RaRVJUTXhNME5DT1RBeFJqTTROQSJ9.eyJpc3MiOiJodHRwczovL2dseW50LWRldi5hdXRoMC5jb20vIiwic3ViIjoiYXV0aDB8NWMzNjE3N2IyODEwMjc1YjkzNDM0NjllIiwiYXVkIjoiZ2x5bnQtcHVibGljLWFwaS1kZXYiLCJpYXQiOjE1NDc3NDUzOTAsImV4cCI6MTU0NzgzMTc5MCwiYXpwIjoiM3dKTlNrUWc3ZFcweTFvVFg0WFRKVmxLd0NCc1ZablYiLCJzY29wZSI6IndyaXRlIHJlYWQiLCJndHkiOiJwYXNzd29yZCJ9.kUTnyQ_sxWMdRzCLnGLGs5XfiCh7IEWECI0BF2LhiAMt4GETr1-4FaqTm0ErnNpl7ZbKcLrf5wxWMCFMlkZDAGkERULRP6EtqVQjigU9P8QyXU8nSV9s05AB3K6LDAB1rFH5hjXJY8uNADbAR8ftx7QXBf0nBiy8Hsmeh9J7KhqhgIBAIFDema6OR02I4I9ovWsn2TcoHdfuKgtOFKkn8RGPR-6HgPAau8kl9NQTQDQsqsbqsPmh4f-8iZzNB5peAkHNggsoYoJREICAPWACkaMDCK7mLc8ELfbCeTJpN4w_7Bkff9iUs0xnH4gGF0KpUNRfu2aDr_QVn-oHNuGXsg",
  "token_type": "Bearer"
}

Users of your organization may use their credentials to request an API access token using a POST to the /get-token/ endpoint (see the detailed specification for more details). To have accounts created for your users, please contact your GLYNT representative. These are the same credentials used to access the browsable api at https://api.glynt.ai.

This flow is intended for developers to have easy access to the API using simple credentials, in order to become familiar with the API or to execute ad hoc requests outside the scope of a more complete integration (which should use the Machine-to-Machine Flow.)

User passwords must be 12+ characters, and there are no special character requirements.

Machine-to-Machine Flow

To retrieve an access token using the M2M Flow:

curl --request POST \
     --url 'https://glynt.auth0.com/oauth/token' \
     --header 'content-type: application/json' \
     --data '{"grant_type":"client_credentials","client_id":"<YOUR_CLIENT_ID>","client_secret": "<YOUR_CLIENT_SECRET>","audience":"glynt-public-api"}'

Make sure to replace <YOUR_CLIENT_ID> and <YOUR_CLIENT_SECRET> with the values provided to you by your GLYNT representative. This command will return JSON structured like this:

{
  "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6IlFqWXdSRFUzUkRrNFFVWXhPRVE0UTBZNE5VWXhSalV3T0RaRVJUTXhNME5DT1RBeFJqTTROQSJ9.eyJpc3MiOiJodHRwczovL2dseW50LWRldi5hdXRoMC5jb20vIiwic3ViIjoicDJlNXc4M1V6WkpZd3Bka3c1NkhBU0RRS3JsN3VpSFhAY2xpZW50cyIsImF1ZCI6ImdseW50LXB1YmxpYy1hcGktZGV2IiwiaWF0IjoxNTQ3NzQ1NzU5LCJleHAiOjE1NDc4MzIxNTksImF6cCI6InAyZTV3ODNVelpKWXdwZGt3NTZIQVNEUUtybDd1aUhYIiwic2NvcGUiOiJ3cml0ZSByZWFkIiwiZ3R5IjoiY2xpZW50LWNyZWRlbnRpYWxzIn0.n6aGI5G07bv_0Ur_XfN3M7Hh_NMpDU4TDj90aiKNsdKq7Jx_IyAud77vmdYLYlZ9-GJkcY-Qivl2GT0CW7uaLdIuCv3ZRTrR2fTqSomsFJh5Frsuu2w0DBbC6NbuKC1fIDFpqoCHJC5pmnvS9f3kdlaQJRbbTLhEJSDQRo6wh02bhtG63f8h8KUKJiJ4J7GeOfq0tQ-d3vf7dvcIqLHPJ0eaYNmTliI_Tw-ah6voql_3m-wpCqTA7wJGjNNw8ogs1-Lhke2X2Z_PoIh__bmq8PKGNmnVMTTpHRibsiiXl9KLpzwDBQOsUUN2EUrXURs1FDVx9iAaQBgNHTQD2i0qqQ",
  "token_type": "Bearer"
}

The primary method for communicating with the GLYNT API is through Machine to Machine integrations. M2M integrations allow an application written by your Organization to have a secure set of credentials to interact with the GLYNT API outside of the context of a user. This allows you to automate interactions with the API, for example if you wanted to create tooling or building your own user interface.

To get started, contact your GLYNT representative. They will provide you with a Client ID and Client Secret. Keep these credentials safe and secure. If you ever suspect they have been compromised, contact your GLYNT Representative to have access revoked immediately.

Using the Client ID and Client Secret, your applications can execute an Oauth 2.0 Client Credentials Flow in order to obtain an access token for the GLYNT API.

Rate Limit

The API has a general rate limit of 200 requests per minute.

Labels, Tags & GLYNT Tags

Several resources have a label property. This property is an arbitrary string of at most 255 characters for documents and 63 for all other resources. It is a meaningful, unique label for the resource. Uniqueness is enforced depending on the resource type. See the relevant resource section for more information.

Several resources have tags and/or glynt_tags properties. These are both lists of strings, and each string is limited to 100 characters.

tags are for your use only, allowing you to attach more verbose information to a resource to facilitate managing your data. GLYNT will never modify this data, and you may always make changes to them as long as the resource exists. Each tags property may contain up to 10 tags. All tags are case insensitive.

You have read-only access to glynt_tags. This data is assigned by the GLYNT system or administrators. It is most often used for internal tagging, feature previews, or to facilitate communication about objects in the API. Unless noted otherwise, this data is volatile and can change without notice.

API List View Pagination

This query will show the 49th and 50th Documents.

curl "https://api.glynt.ai/v6/data-pools/pRvt5/documents/?limit=2&offset=48"

This command will return JSON structured like this (many properties have been excluded from each document in this example to simplify the example.):

{
  "count": 50,
  "next": null,
  "previous": "http://api.glynt.ai/v6/data-pools/pRvt5/documents/?limit=2&offset=46",
  "results": [
    {
      "url": "http://api.glynt.ai/v6/data-pools/pRvt5/documents/a841b7ba/",
      "id": "a841b7ba",
      "label": "one_cool_doc.pdf"
    },
    {
      "url": "http://api.glynt.ai/v6/data-pools/pRvt5/documents/aef4b54e/",
      "id": "aef4b54e",
      "label": "a_lame_doc.jpg"
    }
  ]
}

Endpoints which return multiple resource instances are paginated. In addition to the results property which contains the resource instances themselves, such paginated endpoints also return count, next, and previous properties. These communicate the total number of resource instances in the list, a link to the next page in the list, and a link to the previous page in the list respectively. If there is no next or previous page, that property will be null.

The API uses a limit-offset pagination scheme. The limit is the number of items to retrieve, and defaults to 10 if not provided, and the maximum allowed value is 100. The offset indicates how many items in the list to skip, and defaults to 0. Thus, if limit and offset are both ignored the view would show the 10 oldest instances of the resource.

Ordering List Views

Take as an example this request:

curl "https://api.glynt.ai/v6/data-pools/pRvt5/documents/"

That will return all documents in the data pool, sorted by creation date, oldest first. That is equivalent to this:

curl "https://api.glynt.ai/v6/data-pools/pRvt5/documents/?ordering=created_at"

We could instead invert the ordering, and have newest first:

curl "https://api.glynt.ai/v6/data-pools/pRvt5/documents/?ordering=-created_at"

Or we could sort alphabetically by label:

curl "https://api.glynt.ai/v6/data-pools/pRvt5/documents/?ordering=label"

Or we could combine two ordering values to first order by update time, and when two documents have matching updated_at properties, further sort by created_at, newest first:

curl "https://api.glynt.ai/v6/data-pools/pRvt5/documents/?ordering=updated_at,-created_at"

By default, items in all list views are ordered by creation date, starting with the oldest.

The ordering query parameter may be passed when requesting any list view, and will be used to control how to order the items in the list. The below table summarizes options which are available on all list views, so long as the property itself exists on the resource type being listed. Some resources have special ordering values. The detailed documentation of those resources will explain their use.

Ordering Value Effect
created_at By creation date, oldest first. This is the filter which is used if no ordering filter is passed.
updated_at By last updated date, oldest first.
label Alphabetical by label.

Every ordering filter value may have a - prepended to it, and this will reverse the usual ordering.

Multiple ordering values may be passed comma separated. In this case, objects will be ordered by the first ordering value, then by the second, and so on.

Filtering

This is an example of filtering Documents for those which are tagged both invoice and customer 7

curl "https://api.glynt.ai/v6/data-pools/pRvt5/documents/?tag=invoice&tag=customer+7"

This command will return JSON structured like this (many properties have been excluded from each document in this example to simplify the example.):

{
  "count": 15,
  "next": "http://api.glynt.ai/v6/data-pools/pRvt5/documents/?limit=10&offset=10",
  "previous": null,
  "results": [
    {
      "url": "http://api.glynt.ai/v6/data-pools/pRvt5/documents/a841b7ba/",
      "created_at": "2020-02-06T21:06:33.110518Z",
      "updated_at": "2020-02-06T21:10:02.149824Z",
      "id": "a841b7ba",
      "label": "one_cool_doc.pdf",
      "tags": ["invoice", "customer 7", "group 3"]
    },
    {
      "url": "http://api.glynt.ai/v6/data-pools/pRvt5/documents/aef4b54e/",
      "created_at": "2020-02-08T01:00:58.012030Z",
      "updated_at": "2020-02-08T01:03:22.288199Z",
      "id": "aef4b54e",
      "label": "a_lame_doc.jpg",
      "tags": ["invoice", "customer 7"]
    }
  ]
}

Given a Training Set of ID 'ts12345', query Documents related to that Training Set (output not shown):

curl "https://api.glynt.ai/v6/data-pools/pRvt5/documents/?training_set=ts12345"

Query Training Sets which are related to Document of ID 'do12345' (output not shown):

curl "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/?document=do12345"

Query Training Sets created before February 2nd:

curl "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/?created_before=2020-02-08T00:00"

Query Training Sets created after the training set of ID a841b7ba was created. Note that since datetime filters are inclusive, we increase the microseconds by one to exclude it:

curl "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/?created_after=2020-02-06T21:06:33.110519Z

Endpoints which return multiple resource instances may be filtered to return only a subset of the complete list. Multiple filters may be passed, and they will all be applied to find only resources which fulfill the requirements of all the filters.

Filter values must be (URL encoded)[https://www.w3schools.com/tags/ref_urlencode.ASP]. All filters are case sensitive.

When passing a datetime value as a filter value, it should be ISO 8601 formatted, with a minimum specificity of YYYY-MM-DDThh:mm. If further granularity is desired, seconds, fractional seconds (up to 6 decimal places), and timezone offsets are all supported as per the ISO 8601 specification. An example of a completely specific datetime would be: 2020-02-06T11:45:13.000000-08:00, meaning February 6th 2020 at 11:45:13 Pacific Standard Time.

When a resource has a labels, created_at, updated_at, tags, or glynt_tags property, the following filters are available:

Query Parameter Filters For
label Resources whose label exactly matches the given label.
label__contains Resources whose label contains the given string.
created_before Resources which have a created_at at or before the given datetime.
created_after Resources which have a created_at at or after the given ISO 8601 datetime.
updated_before Resources which have an updated_at including or before the given ISO 8601 datetime.
updated_after Resources which have an updated_at including or before the given ISO 8601 datetime.
tag Resources with a Tag exactly matching the given Tag label.
tag__contains Resources with at least one Tag which contains the given string.
exclude_tag Resources that do not have a Tag with the exact matching string.
glynt_tag Resources exactly matching the given GLYNT Tag label.
glynt_tag__contains Resources with at least one GLYNT Tag which contains the given string.
exclude_glynt_tag Resources that do not have a GLYNT Tag with the exact matching string.

Whenever a resource has a relationship to other resource(s), that relationship can be queried using a filter with the name <related_resource_name>=<id_to_filter_on>. Such filters always use underscores between words in the filter name, and always use the singular form of the resource, never the plural (for example, ?training_set=<some_id>). These filters only search for exactly matching IDs, not partial matches. As with all filters, multiple filters may be passed to further restrict the query.

Some resources also provide specialized filters. See the relevant Resource section for more details.

Errors

When errors occur, they are returned with an HTTP status code, and a JSON body. For most errors, a detail property is included in the JSON body providing more information. For example, if you requested a resource which does not exist, you would receive a 404 status code with a JSON body like the following:

{
  "detail": "The requested resource could not be found."
}

For most 400 Bad Request errors, the returned JSON body includes properties indicating the input parameters causing the error, and/or a non_field_errors property indicating the errors not caused by specific input parameter, providing lists of information. For example, if you requested to create a resource with missing required input, you would receive a 400 status code with a JSON body like the following:

{
  "label": ["Label is required"]
}

The GLYNT API uses the following error codes(*):

Error Code Meaning
400 Bad Request -- Your request is invalid.
401 Unauthorized -- Your access token was not provided or is invalid.
403 Forbidden -- You do not have access to the requested resource.
404 Not Found -- You requested a resource that does not exist.
405 Method Not Allowed -- You tried to access a resource with an invalid method.
415 Unsupported Media Type -- Your request payload is in a format not supported by this resource.
429 Too Many Requests -- You've exceeded the rate limit.
500 Internal Server Error -- We had a problem with our server. Try again later. If it persists, contact your GLYNT representative.
503 Service Unavailable -- We're temporarily offline for maintenance. Please try again later.

* This error documentation applies to all endpoints hosted at api.glynt.ai. URLs outside of this domain have their own error handling procedures. Contact your GLYNT representative if there are any issues interacting with the 3rd party URLs.

Data Pools

Data Pools Overview

A Data Pool can be thought of as an "environment." Each Data Pool is a completely separate "silo" of documents, training sets, extractions, etc. Most often, you will need only two Data Pools: Sandbox (for testing integrations), and Production.

To manage your Data Pools, contact your GLYNT representative.

Retrieve all Data Pools

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/"

This command will return a 200 response with a JSON structured like this:

{
  "count": 2,
  "next": null,
  "previous": null,
  "results": [
    {
      "created_at": "2019-01-16T20:24:21.467694Z",
      "updated_at": "2019-01-16T20:24:21.467694Z",
      "url": "https://api.glynt.ai/v6/data-pools/pRvt5/",
      "id": "pRvt5",
      "label": "Sandbox",
      "documents": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/",
      "extraction_batches": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction_batches/",
      "extractions": "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/",
      "training_sets": "https://api.glynt.ai/v6/data-pools/pRvt5/training_sets/",
      "glynt_tags": []
    },
    {
      "created_at": "2019-01-16T21:21:30.467694Z",
      "updated_at": "2019-01-16T21:24:31.645855Z",
      "url": "https://api.glynt.ai/v6/data-pools/dD314/",
      "id": "dD314",
      "label": "Production",
      "documents": "https://api.glynt.ai/v6/data-pools/dD314/documents/",
      "extraction_batches": "https://api.glynt.ai/v6/data-pools/dD314/extraction_batches/",
      "extractions": "https://api.glynt.ai/v6/data-pools/dD314/extractions/",
      "training_sets": "https://api.glynt.ai/v6/data-pools/dD314/training_sets/",
      "glynt_tags": []
    }
  ]
}

Lists all Data Pools. Notice that there are a collection of properties which link to the list views of the resources associated with the Data Pool.

HTTP Request

GET <api_base_url>/data-pools/

Retrieve a Data Pool

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/"

If the Data Pool exists, this command will return a 200 response with a JSON body structured like this:

{
  "created_at": "2019-01-16T20:24:21.467694Z",
  "updated_at": "2019-01-16T20:24:21.467694Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/",
  "id": "pRvt5",
  "label": "Sandbox",
  "documents": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/",
  "extraction_batches": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction_batches/",
  "extractions": "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/",
  "training_sets": "https://api.glynt.ai/v6/data-pools/pRvt5/training_sets/",
  "glynt_tags": []
}

Returns detailed information about a given Data Pool.

HTTP Request

GET <api_base_url>/data-pools/<data_pool_id>/

Documents

Documents Overview

A Document is an image, pdf, scan, etc, and it's associated meta data. The actual content of the file is referred to as the Document content. All other fields (such as label, tags, etc.) are that Document's metadata. These must meet certain imaging quality metrics. Once created, most properties of a Document resource are immutable. See the Change Document Properties section for more details.

Documents can be single or multiple pages. Artifacts and results will refer to the page numbers sequentially starting from 1.

Uploading a Document's content can be achieved in one of two ways. It can be included directly in the initial POST request, or it can be uploaded in a separate PUT request after the Document instance has been created. See the Create a Document section for more details.

Once uploaded, the Document content can be accessed through temporary urls. These urls can not be altered and must be used exactly as provided. Each Document resource which has an associated file has a permanent file_access_url. By retrieving this URL, a file_temp_url is generated and returned to you. This file_temp_url may be used to directly retrieve the file content for 1 hour. See the Retrieve a Document section for more details.

Retrieve all Documents

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/documents/"

This command will return a 200 response with a JSON structured like this:

{
  "count": 2,
  "next": null,
  "previous": null,
  "results": [
    {
      "created_at": "2019-01-16T20:24:21.467694Z",
      "updated_at": "2019-01-16T20:24:21.467694Z",
      "url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/c71d90d2/",
      "file_access_url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/c71d90d2/file/",
      "id": "c71d90d2",
      "label": "one_cool_doc.pdf",
      "tags": [],
      "glynt_tags": [],
      "content_type": "application/pdf",
      "content_md5":"4DujaMxdUy64mWOWbP6Xew=="

    },
    {
      "created_at": "2019-01-16T21:21:30.467694Z",
      "updated_at": "2019-01-16T21:24:31.645855Z",
      "url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a2904/",
      "file_access_url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a2904/file/",
      "id": "442a2904",
      "label": "a_lame_doc.tiff",
      "tags": [],
      "glynt_tags": [],
      "content_type": "image/tiff",
      "content_md5":"gHtrtAskfdFDS2d11skAew=="
    }
  ]
}

Lists all Documents in the DataPool.

HTTP Request

GET <datapool_url>/documents/

Create a Document

To create a Document using the recommended single-call approach, POST to the GLYNT API to create the Document instance and upload the content in one step. Include the Base64-encoded document content in the content key:

curl --request POST \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/documents/" \
     --header "Authorization: Bearer abc.123.def" \
     --header "content-type: application/json" \
     --data '{"label":"sample_doc_name","tags":["sample_tag"],"content_type":"application/pdf","content":"6jughaFskanHuja55lakAsi---snipped for brevity---mskni8nv292wnv232v33df2k323f2=="}'

On success, this command will return a 201 response with a JSON body structured like this. Notice that the content_md5 will be calculated and displayed if not provided on upload. The Document is now created and uploaded.

{
  "created_at": "2018-02-16T21:21:30.467694Z",
  "updated_at": "2018-02-16T21:21:30.467694Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a2904/",
  "file_access_url": "https://files.glynt.ai?signature=abc123def456",
  "id": "442a2904",
  "label": "sample_doc_name",
  "tags": ["sample_tag"],
  "glynt_tags": [],
  "content_type": "application/pdf",
  "content_md5":"4DujaMxdUy64mWOWbP6Xew=="
}

If you wish to include a content_md5 for increased data integrity guarantees, the first step is to generate the verification hash for the file you are going to upload. This is achieved by first generating an MD5 binary and then encoding it as a Base64 string. The MD5 digest, before being encoded as Base64, is a binary entity - do not convert to hexidecimal representation.

Below is an example of a shell command which combines these two steps:

openssl dgst -md5 -binary /path/to/some/file.pdf | openssl enc -base64

This returns the Base64 encoded string to use for the content_md5. Something like the following:

4DujaMxdUy64mWOWbP6Xew==

This value may now be included in the content_md5 field of the POST request, and will be used to guarantee that the file we receive is the file you sent.

The recommended method of creating a Document is to simply include the Base64 encoded Document content in the POST request made to this endpoint.

This method requires no subsequent call to upload the Document content, because the content is included in the POST request. If you have already integrated the legacy Document upload method, see the Create a Document (legacy) section, below, for that documentation.

If you wish to have stronger guarantees that the received Document is not corrupted in transit, you may optionally include the content_md5 property when creating a Document. See the Request Parameters table, below, for more information.

The GLYNT code examples git repo includes upload_document.py example python script, demonstrating this upload method.

The allowed content types for files are listed below:

HTTP Request

POST <datapool_url>/documents/

Request Body Parameters

Parameter Default Description
content_md5 Autocalculated The Base64-encoded 128-bit MD5 digest of file content according to RFC 1864. If provided, will be used to validate the uploaded content, and the POST will be rejected and not created if the provided value does not match what the API calculates from the content. Used to guarantee that there is no corruption of the document data during transit or decryption.
content_type None Required. The content type of the file. See the allowed content types list above.
label A UUID A string label for the Document. 255 character maximum. See the Labels, Tags & GLYNT Tags section. Unlike most labels in the API, this label does not need to be unique, though it is encouraged to use a unique, semantic label for every Document. Most often, this is the filename of the Document.
content None Required. The Base64-encoded content of the file.
tags [] A tags list. See the Labels, Tags & GLYNT Tags section.

Create a Document (Legacy)

To create a Document using the legacy creation method, the first step is to generate the verification hash for the file you are going to upload. This is achieved by first generating an MD5 binary and then encoding it as a Base64 string. The MD5 digest, before being encoded as Base64, is a binary entity - do not convert to hexidecimal representation.

Below is an example of a shell command which combines these two steps:

openssl dgst -md5 -binary /path/to/some/file.pdf | openssl enc -base64

This returns the Base64 encoded string to use for the content_md5. Something like the following:

4DujaMxdUy64mWOWbP6Xew==

Next, POST to the GLYNT API to create the Document instance. Notice that the file content is not uploaded at this time and the Content-Type header still refers to the 'application/json' content type of the request, while the content_type key in the request data itself refers to the Document file type.

curl --request POST \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/documents/" \
     --header "Authorization: Bearer abc.123.def" \
     --header "content-type: application/json" \
     --data '{"label":"sample_doc_name","tags":["sample_tag"],"content_type":"application/pdf","content_md5":"4DujaMxdUy64mWOWbP6Xew=="}'

On success, this command will return a 201 response with a JSON body structured like this. Notice that the unique file_upload_url key is present.

{
  "created_at": "2018-02-16T21:21:30.467694Z",
  "updated_at": "2018-02-16T21:21:30.467694Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a2904/",
  "file_access_url": "",
  "id": "442a2904",
  "label": "sample_doc_name",
  "tags": ["sample_tag"],
  "glynt_tags": [],
  "content_type": "application/pdf",
  "content_md5":"4DujaMxdUy64mWOWbP6Xew==",
  "file_upload_url": "https://files.glynt.ai?signature=abc123def456"
}

With the file_upload_url in hand, you can now upload the file content itself. Remember: no Authorization header should be present on the second request because this is a presigned URL. The content-type and content-md5 headers must now match the values provided in the json data of your initial POST request which reflect the file being uploaded. Notice that in this cURL example, we use the --upload-file flag, which causes cURL to include the document content as the body content of the request.

curl --request PUT \
     --header "content-type: application/pdf" \
     --header "content-md5: 4DujaMxdUy64mWOWbP6Xew==" \
     --upload-file  "/some/local/file" \
     --url "https://files.glynt.ai?signature=abc123def456"

On success this will return a 2xx status code (exact code can vary) and may or may not return a response body. The Document is now created and uploaded.

The legacy method of uploading a document is outlined below. This method is not recommended for new integrations. The documentation is retained here for users with existing integrations.

To create a Document using the legacy method, you use the same endpoint as the primary Document upload method. However, in the legacy method, do not include the Document content in the initial call. Thus, using the legacy upload method, content_md5 is a mandatory field on the initial POST request.

Following this flow, a successful POST request will return a file_upload_url. This URL is valid for 10 minutes, during which time you must upload the raw Document content bytes to the file_upload_url as the body of a PUT request. You must include the Content-Type and Content-MD5 headers on this PUT request, and they must match the corresponding values provided during the POST step.

Once the 10 minute window expires, or if you uploaded the content directly in the initial request, the content of the file can never be changed. If no content was uploaded, then the Document instance is worthless, and can be deleted. Because of this, it is recommended that you always upload the file content promptly after the initial request.

Our YouTube channel includes a video tutorial demonstrating how to upload a Document using the legacy approach using Postman. The GLYNT code examples git repo includes the legacy_upload_document.py example python script, also demonstrating the legacy upload method.

HTTP Request

POST <datapool_url>/documents/

Request Body Parameters

Parameter Default Description
content_md5 None Required in the legacy upload method. Unlike the standard Document upload method, this content_md5 will be stored - but not validated - on POST. Instead, the content_md5 will be used to validate the uploaded content during the subsequent PUT.
content_type None Same as standard Create a Document.
label A UUID Same as standard Create a Document.
content None May not be passed in the legacy upload method. If it is passed, the system will assume you are attempting the standard Document upload method.
tags [] A tags list. See the Labels, Tags & GLYNT Tags section.

Retrieve a Document

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a2904/"

If the Document exists, this command will return a 200 response with a JSON body structured like this:

{
  "created_at": "2018-02-16T21:21:30.467694Z",
  "updated_at": "2018-02-16T21:21:30.467694Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a2904/",
  "file_access_url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a2904/file/",
  "id": "442a2904",
  "label": "sample_doc_name",
  "tags": ["awesome"],
  "glynt_tags": [],
  "content_type": "application/pdf",
  "content_md5":"4DujaMxdUy64mWOWbP6Xew=="
}

Returns detailed information about a given Document.

HTTP Request

GET <datapool_url>/documents/<document_id>/

Retrieve Document Content

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a2904/file/"

This returns a temporary url which can be used to access the file content directly.

{
  "file_temp_url": "https://files.glynt.ai/442a2904?signature=123abc"
}

This url may now be used to directly access the file content for 1 hour. Remember, no Authorization header is necessary because this is a presigned URL. Notice that you may only execute GET requests with this URL.

curl --url "https://files.glynt.ai/442a2904?signature=123abc"

Retrieve a temporary file url which can be used to directly access file content for up to 1 hour. You can only read the file content with this URL - you cannot modify or delete it.

HTTP Request

GET <datapool_url>/documents/<document_id>/file/

Change Document Properties

curl --request PATCH \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a27b0/" \
     --header "Authorization: Bearer abc.123.def" \
     --header "content-type: application/json" \
     --data '{"tags":["sample_tag","advanced"]}'

On success, this command will return a 201 response with a JSON body structured like this:

{
  "created_at": "2018-02-16T21:21:30.467694Z",
  "updated_at": "2018-02-16T23:22:24.103289Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a27b0/",
  "file_access_url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/c71d9c26/file/",
  "id": "442a27b0",
  "label": "sample_doc_name",
  "tags": ["sample_tag","advanced"],
  "glynt_tags": [],
  "content_type": "application/pdf",
  "content_md5":"4DujaMxdUy64mWOWbP6Xew=="
}

Change the mutable properties of a Document. The mutable properties are listed in the Request Body Parameters below.

HTTP Request

PATCH <datapool_url>/documents/<document_id>/

Request Body Parameters

Parameter Description
label See Create a Document request body parameters.
tags See Create a Document request body parameters.

Delete a Document

curl --request DELETE \
     --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/documents/442a27b0"

On success, this command will return a 204 response with no JSON body.

Removes a Document and its associated data including Extractions and Ground Truths. It is also removed from associated Extraction Batches and Training Sets.

It is also removed from Training Revisions, but the Training Revision will continue to function since the underlying model has already been created, meaning all that is lost is the ability to view which Documents were part of the Training Revision for historical purposes.

HTTP Request

DELETE <datapool_url>/documents/<document_id>/

Training Sets

Training Sets Overview

A Training Set is a collection of Documents which are brought together to form a workspace in order to work on that collection of Documents as a group. A Document can be a part of any number of Training Sets.

Training Sets are created by your GLYNT representative and are not editable through the API at this time. When you provide training Documents and the list of fields you would like extracted from them to your GLYNT representative, the Documents are uploaded for you and Training Sets are created to extract that data. The GLYNT AI learns to extract the data you want from the Documents you provide.

In order to maximize accuracy, each Training Set is created to extract data from a specific class of Documents. For example, if you provide a collection of Documents for training which are from two different publishers, two Training Sets will be created - one for publisher A and one for publisher B. Each Training Set has a unique label and description, which can be used to differentiate between the Training Sets.

If you wish to delete Training Sets, please contact your GLYNT representative.

Retrieve all Training Sets

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/"

This command will return a 200 response with a JSON body structured like this:

{
  "count": 2,
  "next": null,
  "previous": null,
  "results": [
    {
      "created_at": "2019-01-16T20:24:21.467694Z",
      "updated_at": "2019-01-16T20:24:21.467694Z",
      "url": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/3f708802/",
      "id": "3f708802",
      "label": "Electricty Company Inc.",
      "description": "Training Set for extracting data from Electrity Company Inc. invoices.",
      "documents": [
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/4de1ca72/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/54abcb1e/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/5a3fb342/",
      ],
      "glynt_tags": []
    },
    {
      "created_at": "2019-01-16T21:21:30.467694Z",
      "updated_at": "2019-01-16T21:24:31.645855Z",
      "url": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/5f8d2c80/",
      "id": "5f8d2c80",
      "label": "Gas Company LLC",
      "description": "A Training Set which extracts data from Gas Company, LLC natural gas bills.",
      "documents": [
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/6b73b082/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/709523f2/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/778ba064/",
      ],
      "glynt_tags": []
    }
  ]
}

Lists all Training Sets in the DataPool.

HTTP Request

GET <datapool_url>/training-sets/

Filtering

In addition to the filters specified in Filtering section, this endpoint supports a trained=true filter which, when applied, restricts results to Training Sets that have been successfully trained. Passing any value other than true (case sensitive) filters out all Training Sets (returns empty results).

Retrieve a Training Set

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/3f708802/"

If the Training Set exists, this command will return a 200 response with a JSON body structured like this:

{
  "created_at": "2019-01-16T20:24:21.467694Z",
  "updated_at": "2019-01-16T20:24:21.467694Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/3f708802/",
  "id": "3f708802",
  "label": "Electricty Company Inc.",
  "description": "Training Set for extracting data from Electrity Company Inc. invoices.",
  "documents": [
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/4de1ca72/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/54abcb1e/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/5a3fb342/",
  ],
  "glynt_tags": []
}

Returns detailed information about a given Training Set.

HTTP Request

GET <datapool_url>/training-sets/<training_set_id>/

Extractions

Extractions Overview

An Extraction is an extraction job for a single document. They may be created directly or automatically by an Extraction Batch. Each Extraction has it's own status and finished property, independent of any parent Extraction Batch, if applicable.

Extractions can be run in classification mode using the classify field which is a boolean indicating that this Extraction will engage the classification process. When running an Extraction in classify mode, no Training Set may be given as the system will evaluate and select the best Training Set to use from the same Data Pool. When the classifying Extraction completes, the system will populate the training_set and fields properties automatically.

All possible status values are listed in the table below.

Status Meaning
Pending Extraction has not yet started processing.
In Progress Extraction is in progress.
Verifying Extraction result is being verified.
Success Extraction finished processing.
Failed Extraction finished with an error. No data was extracted.

Sample result for a single Field, called Billing_Month. Notice it is comprised of two tokens: November and 2018.

"Billing_Month": {
  "content": "November 2018",
  "tokens": [
    {
      "content": "November",
      "page": 1,
      "bbox": [
        {"x": 1410, "y": 55},
        {"x": 1644, "y": 55},
        {"x": 1644, "y": 92},
        {"x": 1410, "y": 92}
      ],
    },
    {
      "content": "2018",
      "page": 1,
      "bbox": [
        {"x": 1650, "y": 55},
        {"x": 1720, "y": 55},
        {"x": 1720, "y": 92},
        {"x": 1650, "y": 92}
      ],
    }
  ]
}

Successful extractions have a results property, which links to the endpoint to view the created extraction results. This property is omitted when retrieving all extractions, so use the Retrieve an Extraction endpoint to view the results.

Results are in JSON format, where the properties are the Fields as defined by the Training Set, and the values are the extracted content of the field, as well as useful metadata about the extracted content. The results properties are explained in the following table:

Parameter Description
content The extracted string content for the field. If GLYNT did not extract data for this field, this parameter's value will be null. If the system can confidently assert that the Field is not present in the Document, then the value will be an empty string.
tokens The tokens (see Definitions of the document which were used to construct the content. The value of this property is an object, and it's sub-properties are listed below. Note that tokens are not always available, and this key will instead have a value of an empty array when that is the case.
tokens--content String content of the token as it was captured from the document.
tokens--page On which page the token appears.
tokens--bbox Bounding box coordinates of the token as it appears on the page.

Retrieve all Extractions

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/"

This command will return a 200 response with a JSON structured like this:

{
  "count": 2,
  "next": null,
  "previous": null,
  "results": [
    {
      "created_at": "2019-01-16T20:24:25.120938Z",
      "updated_at": "2019-01-16T20:25:59.999881Z",
      "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/e923b69a/",
      "id": "e923b69a",
      "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/774554c8/",
      "extraction_batch": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/f108e010/",
      "document": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/f8796b4e/",
      "status": "Success",
      "finished": true,
      "classify": false,
      "tags": [],
      "glynt_tags": []
    },
    {
      "created_at": "2019-01-16T20:24:26.823493Z",
      "updated_at": "2019-01-16T20:24:26.823493Z",
      "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/03c89b3c/",
      "id": "03c89b3c",
      "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/774554c8/",
      "extraction_batch": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/f108e010/",
      "document": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/ff64f6bc/",
      "status": "In Progress",
      "finished": false,
      "classify": false,
      "tags": [],
      "glynt_tags": []
    }
  ]
}

Lists all Extractions in the DataPool. Omits results property from each Extraction. To see the results, retrieve the individual Extraction.

HTTP Request

GET <datapool_url>/extractions/

Create an Extraction

curl --request POST \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/" \
     --header "Authorization: Bearer abc.123.def" \
     --header "content-type: application/json" \
     --data '{"training_set":"89660ffa","document":"4c543d94"}'

On success, this command will return a 201 response with a JSON body structured like this:

{
  "created_at": "2018-02-17T14:54:30.699864Z",
  "updated_at": "2018-02-17T14:54:30.699864Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/e923b69a/",
  "id": "e923b69a",
  "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/774554c8/",
  "extraction_batch": null,
  "document": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/4c543d94/",
  "status": "Pending",
  "finished": false,
  "classify": false,
  "tags": [],
  "glynt_tags": []
}

To create an Extraction, you may select a Training Set to power the extraction of data. Then, upload the Document whose data you wish to extract. Finally, submit the Extraction with a POST request, passing the Training Set ID and the ID of the Documents to extract data from.

If you set classify to true in the request data, you may not select a Training Set as the system will do this for you.

The status property will change as the Extraction is processed. The finished status will be updated when processing has completed.

HTTP Request

POST <datapool_url>/extractions/

Request Body Parameters

Parameter Default Description
training_set None Required if classify is False. ID of the Training Set that will power the extraction.
document None Required. ID of the Document to extract data from.
classify False Set to True to allow automatic document classification to occur. If True, you may not provide a training_set.
tags [] A tags list. See the Labels, Tags & GLYNT Tags section.

Retrieve an Extraction

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/e923b69a/"

If the Extraction exists, this command will return a 200 response with a JSON body structured like this:

{
  "created_at": "2019-01-16T20:24:25.120938Z",
  "updated_at": "2019-01-16T20:25:59.999881Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/e923b69a/",
  "id": "e923b69a",
  "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/774554c8/",
  "extraction_batch": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/f108e010/",
  "document": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/f8796b4e/",
  "status": "Success",
  "finished": true,
  "classify": false,
  "results": {
    "Billing_Month": {
      "content": "November 2018",
      "tokens": [
        {
          "content": "November",
          "page": 1,
          "bbox": [
            {"x": 1410, "y": 55},
            {"x": 1644, "y": 55},
            {"x": 1644, "y": 92},
            {"x": 1410, "y": 92}
          ],
        },
        {
          "content": "2018",
          "page": 1,
          "bbox": [
            {"x": 1650, "y": 55},
            {"x": 1720, "y": 55},
            {"x": 1720, "y": 92},
            {"x": 1650, "y": 92}
          ],
        }
      ],
      "tags": ["field_tags"],
      "glynt_tags": ["field_glynt_tags"]
    }
  }
}

Returns detailed information about a given Extraction. Results dictionary contains the results for each field including the content, token data, and the tags and glynt tags that were present on the field at the time it was trained.

HTTP Request

GET <datapool_url>/extractions/<extraction_id>/

Change Extraction Properties

curl --request PATCH \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/4427b0/" \
     --header "Authorization: Bearer abc.123.def" \
     --header "content-type: application/json" \
     --data '{"tags":["orig_tag","new_tag"]}'

On success, this command will return a 201 response with a JSON body structured like this:

{
  "created_at": "2018-02-17T14:54:30.699864Z",
  "updated_at": "2018-02-17T14:54:30.699864Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/4427b0/",
  "id": "e923b69a",
  "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/774554c8/",
  "extraction_batch": null,
  "document": "https://api.glynt.ai/v6/data-pools/pRvt5/documents/4c543d94/",
  "status": "Pending",
  "finished": false,
  "classify": false,
  "tags": ["orig_tag", "new_tag"],
  "glynt_tags": []
}

Change the mutable properties of an Extraction. The mutable properties are listed in the Request Body Parameters below.

HTTP Request

PATCH <datapool_url>/extractions/<extraction_id>/

Request Body Parameters

Parameter Description
tags See Create a Extraction request body parameters.

Delete an Extraction

curl --request DELETE \
     --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/e923b69a"

On success, this command will return a 204 response with no JSON body.

Removes an Extraction and associated data.

HTTP Request

DELETE <datapool_url>/extractions/<extraction_id>/

Extraction Batches

Extraction Batches Overview

Extraction Batches are provided as a convenient way to create and group together a collection of Extractions. See the Create an Extraction Batch section for the maximum number of documents which can be created as part of a single batch. As in the typical workflow discussed above, to post an Extraction Batch, you send a POST request with a list of Document IDs to create an Extraction for. You may also pass the ID of which Training Set to extract against. A series of Extractions is then automatically created.

Extraction Batches can be run in classification mode using the classify field which is a boolean indicating that this Extraction Batch will engage the classification process. When running a Batch in classify mode, no Training Set may be given as the system will evaluate and select the best Training Set to use for each Document from the same Data Pool. For this reason the training_set field will remain null as no single Training Set can be associated with the batch. As the child classifying Extractions complete, the system will populate the training_set property automatically.

An Extraction Batch will inform you of the status of the batch, as well as the boolean finished status. It also links to each of the Extractions it creates, and you can retrieve the status values and results of those individual Extractions as they become available if you do not want to wait for the entire Extraction Batch to have completed.

All possible statuses are listed in the table below.

Status Meaning
Pending No Extraction of the Batch has yet started processing.
In Progress At least one Extraction of the Batch is in progress.
Verifying All Extractions of the Batch are either Verifying or finished.
Success Batch finished processing. At least one child Extraction succesfully completed.
Failed Batch finished processing. All child extractions failed.

Retrieve all Extraction Batches

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/"

This command will return a 200 response with a JSON structured like this:

{
  "count": 2,
  "next": null,
  "previous": null,
  "results": [
    {
      "created_at": "2019-01-16T20:24:21.467694Z",
      "updated_at": "2019-01-16T20:34:00.100103Z",
      "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/23d69fec/",
      "id": "23d69fec",
      "label": "December 2018 Invoices",
      "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/33010eda/",
      "documents": [
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/4c543d94/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/56d03d54/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/5d2fb5da/",
      ],
      "extractions": [
        "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/60a8332c/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/6dd93e88/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/73847cc6/",
      ],
      "status": "Success",
      "finished": true,
      "classify": false,
      "tags": [],
      "glynt_tags": []
    },
    {
      "created_at": "2019-01-16T20:26:11.467752Z",
      "updated_at": "2019-01-16T20:28:17.666631Z",
      "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/40f584bc/",
      "id": "40f584bc",
      "label": "Special order forms",
      "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/33010eda/",
      "documents": [
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/4c543d94/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/56d03d54/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/documents/5d2fb5da/",
      ],
      "extractions": [
        "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/6dd93e88/",
        "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/73847cc6/",
      ],
      "status": "In Progress",
      "finished": false,
      "classify": false,
      "tags": [],
      "glynt_tags": []
    }
  ]
}

Lists all Extraction Batches in the DataPool.

HTTP Request

GET <datapool_url>/extraction-batches/

Create an Extraction Batch

curl --request POST \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/" \
     --header "Authorization: Bearer abc.123.def" \
     --header "content-type: application/json" \
     --data '{"label":"December 2018 Invoices","training_set":"89660ffa","documents":["4c543d94","56d03d54","5d2fb5da"]}'

On success, this command will return a 201 response with a JSON body structured like this:

{
  "created_at": "2018-02-17T14:54:30.699864Z",
  "updated_at": "2018-02-17T14:54:30.699864Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/b5a42d68/",
  "id": "b5a42d68",
  "label": "December 2018 Invoices",
  "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/89660ffa/",
  "documents": [
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/4c543d94/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/56d03d54/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/5d2fb5da/",
  ],
  "extractions": [],
  "status": "Pending",
  "finished": false,
  "classify": false,
  "tags": [],
  "glynt_tags": []
}

To create an Extraction Batch, you may select a Training Set to power the extraction of data. Then, upload the Documents whose data you wish to extract. Finally, submit the Extraction Batch with a POST request, passing the Training Set ID and the IDs of the Documents to extract data from.

If you set classify to true in the request data, you may not select a Training Set as the system will do this for you for each Document.

The status property will change as the batch is processed. Extractions are created automatically shortly after the Extraction Batch itself is created. The finished status will be updated when processing has completed.

HTTP Request

POST <datapool_url>/extraction-batches/

Request Body Parameters

Parameter Default Description
documents None Required. List of Document IDs to extract data from. Minimum of 1, maximum of 1500 Document IDs.
label A UUID A string label for the Extraction Batch. See the Labels, Tags & GLYNT Tags section. Must be unique within a Data Pool.
tags [] A tags list. See the Labels, Tags & GLYNT Tags section.
training_set None Required if classify is False. ID of the Training Set that will power the extractions.
classify False Set to True to allow automatic document classification to occur. If true, you may not provide a training_set.

Retrieve an Extraction Batch

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/23d69fec/"

If the Extraction Batch exists, this command will return a 200 response with a JSON body structured like this:

{
  "created_at": "2019-01-16T20:24:21.467694Z",
  "updated_at": "2019-01-16T20:34:00.100103Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/23d69fec/",
  "id": "23d69fec",
  "label": "December 2018 Invoices",
  "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/33010eda/",
  "documents": [
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/4c543d94/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/56d03d54/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/5d2fb5da/",
  ],
  "extractions": [
    "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/60a8332c/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/6dd93e88/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/73847cc6/",
  ],
  "status": "Success",
  "finished": true,
  "classify": false,
  "tags": [],
  "glynt_tags": []
}

Returns detailed information about a given Extraction Batch.

HTTP Request

GET <datapool_url>/extraction-batches/<extraction_batch_id>/

Change Extraction Batch Properties

curl --request PATCH \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/4427b0/" \
     --header "Authorization: Bearer abc.123.def" \
     --header "content-type: application/json" \
     --data '{"label": "New Label"}'

On success, this command will return a 201 response with a JSON body structured like this:

{
  "created_at": "2019-01-16T20:24:21.467694Z",
  "updated_at": "2019-01-16T20:34:00.100103Z",
  "url": "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/4427b0/",
  "id": "23d69fec",
  "label": "New Label",
  "training_set": "https://api.glynt.ai/v6/data-pools/pRvt5/training-sets/33010eda/",
  "documents": [
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/4c543d94/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/56d03d54/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/documents/5d2fb5da/",
  ],
  "extractions": [
    "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/60a8332c/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/6dd93e88/",
    "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/73847cc6/",
  ],
  "status": "Success",
  "finished": true,
  "classify": false,
  "tags": [],
  "glynt_tags": []
}

Change the mutable properties of an Extraction Batch. The mutable properties are listed in the Request Body Parameters below.

HTTP Request

PATCH <datapool_url>/extraction-batches/<extraction_batch_id>/

Request Body Parameters

Parameter Description
label See Create a Extraction Batch request body parameters.
tags See Create a Extraction Batch request body parameters.

Delete an Extraction Batch

curl --request DELETE \
     --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/27b0-20e5-11e9-ab14-d663bd873d93"

On success, this command will return a 204 response with no JSON body.

Removes an Extraction Batch.

HTTP Request

DELETE <datapool_url>/extraction-batches/<extraction_batch_id>/

Retrieve Extraction Batch Field Distribution Information

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/639c2be0/field-distribution" \

On success, this command will return a 200 response with a JSON body structured like this:

{
  "document_count": 2,
  "fields": {
    "Account Number": {
      "responses": 2,
      "response_rate": 100
    },
    "Line Item One": {
      "reponses": 1,
      "response_rate": 50
    }
  },
  "documents": {
    "hg8rr1": {
      "label": "Sample Document 1",
      "training_set_id": "8fnadk",
      "training_set_label": "Sample Training Set",
      "review_url": "http://ui.glynt.ai/data-pools/Dnais5/review/639c2be0/extraction/defgG1",
      "fields": {
        "Account Number": {
          "content": "abc123",
          "tags": ["Sample Tag 1", "Sample Tag 2"],
          "glynt_tags": []
        }
      }
    },
    "nQkaL1": {
      "label": "Sample Document 2",
      "training_set_id": null,
      "training_set_label": null,
      "review_url": "http://ui.glynt.ai/data-pools/Dnais5/review/639c2be0/extraction/hg8rr1",
      "fields": {
        "Account Number": {
          "content": "def456",
          "tags": ["Sample Tag 1", "Sample Tag 2"],
          "glynt_tags": []
        },
        "Line Item One": {
          "content": "$100",
          "tags": [],
          "glynt_tags": ["Sample Glynt Tag"]
        }
      }
    }
  }
}

Retrieve stats csv report with the following request:

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/639c2be0/field-distribution?format=csv" \

On success, this command will return a 200 response with a byte string body representing a csv file containing field distribution stats:

Field Label, Responses, Response Rate, Total Documents
Account Number, 2, 100, 2
Line Item One, 1, 50, 2

Retrieve "wide" csv report with the following request:

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/639c2be0/field-distribution?format=csv&report_type=wide" \

On success, this command will return a 200 response with a byte string body representing a csv file containing field distribution data organized by document:

Training Set Id, Training Set Label, Document Id, Document Label, Account umber, Line Item One, Document Review URL
dsank2, Sample Training Set, hg8rr1, Sample Document 1, abc123, , http://ui.glynt.ai/data-pools/Dnais5/review/639c2be0/extraction/defgG1
, , nQkaL1, Sample Document 2, def456, $100, http://ui.glynt.ai/data-pools/Dnais5/review/639c2be0/extraction/hg8rr1

Retrieve "long" csv report with the following request:

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/639c2be0/field-distribution?format=csv&report_type=long" \

On success, this command will return a 200 response with a byte string body representing a csv file containing field distribution data organized by document and field:

Training Set Id, Training Set Label, Document Id, Document Label, Field Label, Field Tags, Field Glynt Tags, Returned Value, Document Review URL
dsank1, Sample Training Set, hg8rr1, Sample Document 1, Account Number, ["Sample Tag 1", "Sample Tag 2"], [], abc123, http://ui.glynt.ai/data-pools/Dnais5/review/639c2be0/extraction/defgG1
dsank1, Sample Training Set, hg8rr1, Sample Document 1, Line Item One, [], ["Sample Glynt Tag"], , http://ui.glynt.ai/data-pools/Dnais5/review/639c2be0/extraction/defgG1
, , nQkaL1, Sample Document 2, Account Number, ["Sample Tag 1", "Sample Tag 2"], [], def456, http://ui.glynt.ai/data-pools/Dnais5/review/639c2be0/extraction/hg8rr1
, , nQkaL1, Sample Document 2, Line Item One, [], ["Sample Glynt Tag"], $100, http://ui.glynt.ai/data-pools/Dnais5/review/639c2be0/extraction/hg8rr1

Returns information about the distribution of fields in an Extraction Batch.

The training_sets query parameter can be used to limit the report to only the data about Extractions related to one or a handful of passed Training Sets. This is useful when retrieving reports for a large Extraction Batch when you wish to generate smaller reports that focus on certain groups of Documents.

HTTP Request

GET <datapool_url>/extraction-batches/<extraction_batch_id>/field-distribution

Query Parameters

Parameter Default Description
format json Possible values: csv, json. If csv is requested, the response content-type will be application/csv and the raw CSV data will be returned.
report_type stats Possible values: long, wide, stats. This query parameter is ignored unless the format is csv. stats returns a summary view of the data. long returns a field-centric view of the data. wide returns a document-centric view of the data.
training_sets all sets and "None" set Comma separated list of Training Set IDs. One of the elements of the list may be the special value None to include the Extractions with no Training Set. Only Extractions related to the given Training Sets will be included in the report. Similarly, only Fields from the included Extractions will be shown. If not passed, all Extractions in the batch are included in the report.

Classification (DEPRECATED)

Classification Overview

Classification generally describes the process of allowing the system to determine what "class" of Documents a given Document belongs to. In the context of the API, a class of Documents is any group of documents which can have data accurately extracted by a single Training Revision. The primary value of this functionality is to allow the extraction of data from Documents, with the user not having to specify which Training Revision to use for each Document. The system determines the class of each Document, and uses the best Training Revision to extract the field data from the Document.

Classifier Training Sets

Each Data Pool may have a single active "Classifier Training Set." The Classifier Training Set is created by and maintained by staff users. Staff users may apply the Glynt Tag classifier:active_classifier to a single Training Set in the Data Pool. Any Training Set with this Glynt Tag is considered a Classifier Training Set. Contact your Glynt representative to create, modify, or for any questions regarding your Classifier Training Set.

Classifier Extractions

Example Glynt Tags which would be present on a Classification Extraction:

{"glynt_tags": ["classifier:classifier_extraction"]}

Example Glynt Tags which would be present on a child Extraction:

{"glynt_tags": ["classifier:parent:yu7789"]}

Listing child Extractions of a specific Classifier Extraction:

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extractions/?glynt_tag=classifier:parent:yu7789"

Extractions may be initiated in "Classification mode" by setting the training_set to the ID of the current active Classifier Training set, and they are thus called Classifier Extractions. They automatically receive a Glynt Tag of classifier:classifier_extraction. When a Classifier Extraction successfully classifies a Document, a child Extraction is spawned which is responsible for the final data extraction. This child Extraction receives a Glynt Tag of classifier:parent:<id_of_classifier_extraction>. Filtering the Extractions list endpoint for this tag will allow you to find this child Extraction.

Example Glynt Tags which would be present on a Classification Extraction which failed to classify because of no matching classes:

{
  "glynt_tags": [
    "classifier:classifier_extraction",
    "classifier:failed_to_classify:no_matches"
  ]
}

When a Classifier Extraction fails to classify a Document, the Classifier Extraction will receive a Glynt Tag of classifier:failed_to_classify. No child Extraction will be spawned. The Documents of these Extractions will have to be manually classified by a human, and run through the appropriate Training Revision to have data extracted. This tag will sometimes have additional information attached to it, when available. The possible values are below, along with their meaning:

tag Description
classifier:failed_to_classify:too_many_classes More than one class seems like a plausible match for this Document.
classifier:failed_to_classify:no_matches No class seems like a plausible match for this Document.
classifier:failed_to_classify:unrecognized_class A single class was identified, but no training set could be identified from the names of the fields. At least one Field name in the classifying set is incorrect.
classifier:failed_to_classify:recursive_classification The identified class is the Classifying Training Set, which would cause infinite recursion of classification.

In either case, classification should not be considered complete until the classifier:finished tag appears on the Classifier Extraction. This tag indicates that all classification processing is complete. The child Extraction may still be running to execute final data extraction, but all classification is complete.

Classifier Extraction Batches

Example Glynt Tags on a Classification Extraction Batch:

{"glynt_tags": ["classifier:classifier_extraction_batch"]}

Extraction Batches may be initiated in "Classification mode" the same way as Extractions. They are thus called Classifier Extraction Batches, and automatically receive a Glynt Tag of classifier:classifier_extraction_batch. These Extraction Batches will spawn Classifier Extractions, which will execute to completion as per the documentation above.

Example Glynt Tags on a child Extraction Batch:

{"glynt_tags": ["classifier:parent:an2lDS"]}

Once all Classifier Extractions complete and all of their respective child Extractions are created, child Extraction Batches will be spawned by the Classifier Extraction Batch, one for each class of Document present in the set of Documents successfully classified. The child Extractions will be automatically associated with these child Extraction Batches, thus grouping the Extractions by class. These child Extraction Batches will also receive the Glynt Tag classifier:parent:<id_of_classifier_extraction_batch>.

Listing child Extraction Batches of a specific Classifier Extraction Batch:

curl --header "Authorization: Bearer abc.123.def" \
     --url "https://api.glynt.ai/v6/data-pools/pRvt5/extraction-batches/?glynt_tag=classifier:parent:an2lDS"

The Extraction Batches list endpoint can be filtered for this tag to find all the classes which were identified during the run of the Classifier Extraction Batch, and the child Extraction Batches can be used to investigate the results of the run, segmented by class. These child Extraction Batches have all the functionality of normal Extraction Batches, including the reporting endpoints.

Classifier Extraction Batches should not be considered complete until the classifier:finished tag appears on the Classifier Extraction. This tag indicates that all classification processing is complete. Child Extraction Batches may still be running to execute the final data extraction, but all classification is complete.