# Extractly API Documentation for LLMs ## Overview Extractly is a document extraction API that uses AI to extract structured data from PDFs and other documents based on customizable templates. The API supports template-based extraction where you define the fields you want to extract, and the AI processes documents to return structured JSON data. **Base URL**: https://www.extractly.dev **API Version**: V2 (current) --- ## Authentication All API endpoints require authentication using an API key in the `Authorization` header: ``` Authorization: Bearer ext_your_api_key_here ``` ### Getting an API Key 1. Sign up at https://www.extractly.dev 2. Navigate to the dashboard 3. Go to API Keys section 4. Create a new API key 5. Copy the key (it starts with `ext_`) --- ## Core Concepts ### Templates Templates define the structure of data you want to extract from documents. Each template contains: - **name**: Descriptive name for the template - **description**: Optional description ### Field Definitions Each field in a template has: - **name**: Field identifier (e.g., `invoice_number`, `person_name`) - **type**: Data type - `string`, `number`, `date`, `boolean`, or `array` - **description**: Optional description of what this field contains - **required**: Boolean indicating if this field must be present - **aiInstructions**: Optional specific instructions for AI extraction ### Jobs Extraction is asynchronous. When you submit a document: 1. You receive a `jobId` immediately 2. Processing happens in the background (typically 30-60 seconds) 3. You poll the status endpoint to check progress 4. When complete, the result contains extracted data ### Credits - Template extractions cost **10 credits** per document - Check your credit balance before extracting - Insufficient credits will result in a 402 error --- ## API Endpoints ### 1. Get Template Details **GET** `/api/v2/templates/{templateId}` Retrieve details of a specific template including its field definitions. **Headers:** ``` Authorization: Bearer ext_your_api_key_here ``` **cURL Example:** ```bash curl -X GET \ https://www.extractly.dev/api/v2/templates/template_123 \ -H "Authorization: Bearer ext_your_api_key_here" ``` **Response:** ```json { "template": { "id": "template_123", "name": "Invoice Extractor", "description": "Extract key information from invoices", "extractionMethod": "GOOGLE_DOCUMENT_AI", "isDraft": false, "customerId": "customer_abc", "fields": [ { "name": "invoice_number", "type": "string", "description": "Invoice number", "required": true }, { "name": "total_amount", "type": "number", "description": "Total amount", "required": true }, { "name": "invoice_date", "type": "date", "description": "Invoice date", "required": false } ], "createdAt": "2025-10-01T12:00:00.000Z", "updatedAt": "2025-10-01T12:00:00.000Z" }, "version": "v2" } ``` **Rate Limit:** 100 requests per minute **Important:** - Use this endpoint to inspect template structure before extraction - The `fields` array defines what data will be extracted from documents - Template management (create/update/delete) is done via the dashboard UI at https://www.extractly.dev/dashboard/templates --- ### 2. Extract Data from Document **POST** `/api/v2/templates/{templateId}/extract` Submit a document for extraction using a specific template. **Headers:** ``` Authorization: Bearer ext_your_api_key_here Content-Type: multipart/form-data ``` **Form Data:** - `file`: The PDF file to process (max 10MB) **cURL Example:** ```bash curl -X POST \ https://www.extractly.dev/api/v2/templates/template_123/extract \ -H "Authorization: Bearer ext_your_api_key_here" \ -F "file=@invoice.pdf" ``` **Node.js Example:** ```javascript const FormData = require('form-data'); const fs = require('fs'); const axios = require('axios'); const form = new FormData(); form.append('file', fs.createReadStream('invoice.pdf')); const response = await axios.post( 'https://www.extractly.dev/api/v2/templates/template_123/extract', form, { headers: { 'Authorization': 'Bearer ext_your_api_key_here', ...form.getHeaders() } } ); console.log(response.data); ``` **Python Example:** ```python import requests url = "https://www.extractly.dev/api/v2/templates/template_123/extract" headers = {"Authorization": "Bearer ext_your_api_key_here"} with open("invoice.pdf", "rb") as file: files = {"file": file} response = requests.post(url, headers=headers, files=files) print(response.json()) ``` **Alternative: JSON with Base64** You can also send the file as base64-encoded JSON: **Headers:** ``` Authorization: Bearer ext_your_api_key_here Content-Type: application/json ``` **Request Body:** ```json { "file": "base64_encoded_file_content", "fileName": "document.pdf", "mimeType": "application/pdf" } ``` **Response (202 Accepted):** ```json { "message": "Template extraction is processing...", "jobId": "job_abc123", "fileName": "invoice.pdf", "templateId": "template_123", "templateName": "Invoice Extractor", "customerId": "customer_abc", "version": "v2" } ``` **Important:** - Save the `jobId` to check extraction status - Processing typically takes 30-60 seconds - **Supported file formats depend on extraction method**: - **Document AI**: PDF only - **Raw Text Extraction**: PDF, Office files (Word, Excel, PowerPoint), images (PNG, JPG), HTML, and more - Maximum file size: 10MB - Costs 10 credits per extraction **Error Responses:** **400 Bad Request - No file:** ```json { "error": "No file provided" } ``` **400 Bad Request - Invalid file type:** ```json { "error": "Only PDF files are allowed" } ``` **400 Bad Request - File too large:** ```json { "error": "File size too large. Maximum size is 10MB." } ``` **402 Payment Required - Insufficient credits:** ```json { "error": "Insufficient credits", "creditsRequired": 10, "creditsAvailable": 5 } ``` **404 Not Found - Template not found:** ```json { "error": "Template not found" } ``` **429 Too Many Requests - Rate limit exceeded:** ```json { "error": "Rate limit exceeded" } ``` **Rate Limit:** 50 requests per minute --- ### 3. Check Extraction Status **GET** `/api/v2/status/{jobId}` Check the processing status of an extraction job. **Headers:** ``` Authorization: Bearer ext_your_api_key_here ``` **cURL Example:** ```bash curl -X GET \ https://www.extractly.dev/api/v2/status/job_abc123 \ -H "Authorization: Bearer ext_your_api_key_here" ``` **Response - Processing:** ```json { "jobId": "job_abc123", "status": "processing", "message": "Job is currently being processed.", "version": "v2", "createdAt": "2025-10-03T14:06:55.230Z", "updatedAt": "2025-10-03T14:07:30.123Z", "document": null, "template": { "id": "template_123", "name": "Invoice Extractor" } } ``` **Response - Completed:** ```json { "jobId": "job_abc123", "status": "completed", "message": "Job has completed successfully.", "version": "v2", "createdAt": "2025-10-03T14:06:55.230Z", "updatedAt": "2025-10-03T14:09:19.844Z", "document": { "id": "doc_xyz789", "fileName": "invoice.pdf" }, "template": { "id": "template_123", "name": "Invoice Extractor" }, "result": { "invoice_number": "INV-2025-001", "total_amount": "1250.50", "invoice_date": "2025-10-01" }, "completedAt": "2025-10-03T14:09:19.843Z" } ``` **Response - Failed:** ```json { "jobId": "job_abc123", "status": "failed", "message": "Extraction failed: Invalid PDF format", "version": "v2", "createdAt": "2025-10-03T14:06:55.230Z", "updatedAt": "2025-10-03T14:07:00.123Z", "document": null, "template": { "id": "template_123", "name": "Invoice Extractor" }, "error": "Invalid PDF format" } ``` **Job Status Values:** - `pending`: Job is queued for processing - `processing`: Job is currently being processed - `completed`: Job completed successfully (includes `result` field) - `failed`: Job failed (includes `error` field) - `cancelled`: Job was cancelled **Important:** - The `result` field structure depends on the template's field definitions - Poll this endpoint every 2-5 seconds until status is `completed` or `failed` - Results are stored and can be retrieved later using the same `jobId` **Rate Limit:** No explicit rate limit (reasonable polling recommended) --- ## Field Types Reference When defining template fields, use these types: | Type | Description | Example Values | |------|-------------|----------------| | `string` | Text data | "John Doe", "Invoice #12345" | | `number` | Numeric data | 42, 3.14, 1000.50 | | `date` | Date strings | "2025-10-03", "January 1, 2025" | | `boolean` | True/false values | true, false | | `array` | List of strings | ["item1", "item2", "item3"] | --- ## Error Handling ### HTTP Status Codes - `200 OK`: Request successful - `202 Accepted`: Extraction job accepted and processing - `400 Bad Request`: Invalid request (missing file, invalid format) - `401 Unauthorized`: Invalid or missing API key - `402 Payment Required`: Insufficient credits - `404 Not Found`: Template or job not found - `409 Conflict`: Template name already exists or template in use - `429 Too Many Requests`: Rate limit exceeded - `500 Internal Server Error`: Server error ### Common Error Messages ```json { "error": "Template not found" } ``` ```json { "error": "Insufficient credits", "creditsRequired": 10, "creditsAvailable": 5 } ``` ```json { "error": "Rate limit exceeded" } ``` --- ## Rate Limits | Endpoint | Limit | |----------|-------| | Get Template (GET /api/v2/templates/{id}) | 100 requests/minute | | Extract Document (POST /api/v2/templates/{id}/extract) | 50 requests/minute | | Check Status (GET /api/v2/status/{jobId}) | No explicit limit (reasonable polling) | Rate limit information is included in response headers: - `X-RateLimit-Limit`: Maximum requests allowed - `X-RateLimit-Remaining`: Remaining requests in window - `X-RateLimit-Reset`: Unix timestamp when limit resets --- ## Best Practices ### 1. Template Design - **Be specific with field names**: Use descriptive names like `invoice_number` instead of `number` - **Provide AI instructions**: Help the AI understand context with aiInstructions - **Mark required fields**: Set `required: true` for critical fields - **Use appropriate types**: Choose the correct field type for better validation ### 2. Error Handling - Always check credit balance before bulk operations - Implement retry logic for 429 (rate limit) errors - Handle 402 (insufficient credits) by prompting for credit purchase - Parse error messages for user-friendly feedback ### 3. Polling - Poll status endpoint every 2-5 seconds - Implement exponential backoff for long-running jobs - Set a reasonable timeout (e.g., 2 minutes) - Cache completed results to avoid re-polling ### 4. File Preparation - Ensure PDFs are not password-protected - Keep file sizes under 10MB - Use high-quality scans (300 DPI recommended) - Avoid heavily redacted or corrupted files ### 5. Security - Store API keys securely (environment variables, secrets manager) - Never expose API keys in client-side code - Use HTTPS for all requests - Rotate API keys periodically --- ## Example Workflows ### Complete Extraction Workflow ```javascript const axios = require('axios'); const FormData = require('form-data'); const fs = require('fs'); const API_KEY = 'ext_your_api_key_here'; const BASE_URL = 'https://www.extractly.dev'; const templateId = 'template_123'; async function extractDocument(filePath) { // 1. Get template details (optional) const templateResponse = await axios.get( \`${BASE_URL}/api/v2/templates/${templateId}\`, { headers: { Authorization: \`Bearer ${API_KEY}\` } } ); console.log('Fields to extract:', templateResponse.data.template.fields); // 2. Submit document for extraction const form = new FormData(); form.append('file', fs.createReadStream(filePath)); const extractResponse = await axios.post( \`${BASE_URL}/api/v2/templates/${templateId}/extract\`, form, { headers: { Authorization: \`Bearer ${API_KEY}\`, ...form.getHeaders() } } ); const jobId = extractResponse.data.jobId; console.log('Job submitted:', jobId); // 3. Poll for results while (true) { const statusResponse = await axios.get( \`${BASE_URL}/api/v2/status/${jobId}\`, { headers: { Authorization: \`Bearer ${API_KEY}\` } } ); if (statusResponse.data.status === 'completed') { return statusResponse.data.result; } else if (statusResponse.data.status === 'failed') { throw new Error(statusResponse.data.error); } console.log('Status:', statusResponse.data.status); await new Promise(resolve => setTimeout(resolve, 3000)); } } // Use the function const result = await extractDocument('invoice.pdf'); console.log('Extracted data:', result); // Output: { invoice_number: 'INV-2025-001', total_amount: '1250.50', invoice_date: '2025-10-01' } ``` ### Batch Processing ```javascript const axios = require('axios'); const FormData = require('form-data'); const fs = require('fs'); const API_KEY = 'ext_your_api_key_here'; const BASE_URL = 'https://www.extractly.dev'; const templateId = 'template_123'; async function submitDocument(filePath) { const form = new FormData(); form.append('file', fs.createReadStream(filePath)); const response = await axios.post( \`${BASE_URL}/api/v2/templates/${templateId}/extract\`, form, { headers: { Authorization: \`Bearer ${API_KEY}\`, ...form.getHeaders() } } ); return response.data.jobId; } async function waitForJob(jobId) { while (true) { const response = await axios.get( \`${BASE_URL}/api/v2/status/${jobId}\`, { headers: { Authorization: \`Bearer ${API_KEY}\` } } ); if (response.data.status === 'completed') { return response.data.result; } else if (response.data.status === 'failed') { throw new Error(response.data.error); } await new Promise(resolve => setTimeout(resolve, 3000)); } } // Process multiple documents const files = ['invoice1.pdf', 'invoice2.pdf', 'invoice3.pdf']; // Submit all documents const jobIds = await Promise.all(files.map(file => submitDocument(file))); console.log('Submitted jobs:', jobIds); // Wait for all results const results = await Promise.all(jobIds.map(jobId => waitForJob(jobId))); console.log('All results:', results); ``` --- ## Support For API support and questions: - Documentation: https://www.extractly.dev/dashboard/ - Email: hamed+extractly@finna.ai --- ## Changelog ### V2 (Current) - Asynchronous job-based extraction - Improved error handling