datalab-to/marker
Convert PDF to markdown + JSON quickly with high accuracy
Capabilities
Cost
Community model (estimated from hardware time)
Input Parameters
| Name | Type | Description | Default | Constraints |
|---|---|---|---|---|
file * | string (uri) | Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp | — | — |
additional_config | string | Advanced configuration options as JSON string. Options include: 'disable_links' (remove hyperlinks), 'keep_pageheader_in_output' (preserve headers), 'keep_pagefooter_in_output' (preserve footers), 'filter_blank_pages' (skip empty pages), 'drop_repeated_text' (remove duplicates), and layout/table processing thresholds. Full list at: https://documentation.datalab.to/api-reference/marker | — | — |
block_correction_prompt | string | Optional text prompt to guide output improvements. Use this to specify formatting preferences or extraction requirements, e.g., 'Extract all dates in YYYY-MM-DD format' or 'Keep all tables in their original structure' | — | — |
disable_image_extraction | boolean | Skip extracting images from the PDF. By default, images are extracted and returned as base64-encoded data in the images field | false | — |
disable_ocr_math | boolean | Disable recognition of inline mathematical expressions during OCR. By default, math expressions are detected and can be formatted as LaTeX | false | — |
force_ocr | boolean | Force OCR on all pages even if text is extractable. By default, Marker automatically uses OCR only when needed (e.g., scanned PDFs). Enable this if you see garbled or incorrect text in the output | false | — |
format_lines | boolean | Detect and format inline mathematical expressions and text styles (bold, italic, etc.) in the output. Useful for documents with mathematical notation | false | — |
include_metadata | boolean | Include detailed metadata and JSON structure in the output. When enabled, returns json_data (hierarchical document structure with bounding boxes) and metadata (page stats, table of contents). When disabled (default), only returns markdown to reduce response size | false | — |
max_pages | integer | Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive | — | min: 1 |
mode | string | Processing mode affecting speed and quality. 'fast': lowest latency, preserves most positional information. 'balanced': same as using use_llm. 'accurate': highest quality, slowest, preserves least positional information | "fast" | fast balanced accurate |
page_range | string | Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive | — | — |
page_schema | string | Structured extraction: Provide a JSON Schema to extract specific fields from your document. When provided, the model extracts only the fields you define and returns them in the 'extraction_schema_json' output field (as a JSON string containing your extracted data plus citation fields showing which parts of the document were used). The 'markdown' and 'json_data' fields will still contain the full document conversion. Example: {"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}}}. See: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview. Increases cost by 50% | — | — |
paginate | boolean | Add page separators to the output. Each page will be separated by a horizontal rule containing the page number in the format: \n\n{PAGE_NUMBER}\n{48 dashes}\n\n | false | — |
save_checkpoint | boolean | Save processing checkpoint for iterative refinement. Checkpoints can be used with the Marker Prompt API to apply custom rules without re-parsing the entire document. Only useful for advanced workflows | false | — |
segmentation_schema | string | JSON Schema for document segmentation. Define segment names and descriptions to identify and extract different sections of the document (e.g., 'Executive Summary', 'Financial Data'). Useful for splitting long documents by section. See: https://documentation.datalab.to/api-reference/marker | — | — |
skip_cache | boolean | Bypass the server-side cache and force re-processing. By default, identical requests are cached to save time and cost. Enable this to get fresh results | false | — |
strip_existing_ocr | boolean | Remove embedded OCR text layer from the PDF and re-run OCR from scratch. Some PDFs have low-quality embedded OCR text; this option lets you regenerate it. Ignored if force_ocr is enabled | false | — |
use_llm | boolean | Use an LLM to significantly improve accuracy for tables, forms, inline math, and layout detection. This merges tables across pages, handles complex layouts, and extracts form values. Will increase processing time | false | — |
file required string Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp
additional_config string Advanced configuration options as JSON string. Options include: 'disable_links' (remove hyperlinks), 'keep_pageheader_in_output' (preserve headers), 'keep_pagefooter_in_output' (preserve footers), 'filter_blank_pages' (skip empty pages), 'drop_repeated_text' (remove duplicates), and layout/table processing thresholds. Full list at: https://documentation.datalab.to/api-reference/marker
block_correction_prompt string Optional text prompt to guide output improvements. Use this to specify formatting preferences or extraction requirements, e.g., 'Extract all dates in YYYY-MM-DD format' or 'Keep all tables in their original structure'
disable_image_extraction boolean Skip extracting images from the PDF. By default, images are extracted and returned as base64-encoded data in the images field
false disable_ocr_math boolean Disable recognition of inline mathematical expressions during OCR. By default, math expressions are detected and can be formatted as LaTeX
false force_ocr boolean Force OCR on all pages even if text is extractable. By default, Marker automatically uses OCR only when needed (e.g., scanned PDFs). Enable this if you see garbled or incorrect text in the output
false format_lines boolean Detect and format inline mathematical expressions and text styles (bold, italic, etc.) in the output. Useful for documents with mathematical notation
false include_metadata boolean Include detailed metadata and JSON structure in the output. When enabled, returns json_data (hierarchical document structure with bounding boxes) and metadata (page stats, table of contents). When disabled (default), only returns markdown to reduce response size
false max_pages integer Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive
mode string Processing mode affecting speed and quality. 'fast': lowest latency, preserves most positional information. 'balanced': same as using use_llm. 'accurate': highest quality, slowest, preserves least positional information
"fast" page_range string Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive
page_schema string Structured extraction: Provide a JSON Schema to extract specific fields from your document. When provided, the model extracts only the fields you define and returns them in the 'extraction_schema_json' output field (as a JSON string containing your extracted data plus citation fields showing which parts of the document were used). The 'markdown' and 'json_data' fields will still contain the full document conversion. Example: {"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}}}. See: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview. Increases cost by 50%
paginate boolean Add page separators to the output. Each page will be separated by a horizontal rule containing the page number in the format: \n\n{PAGE_NUMBER}\n{48 dashes}\n\n
false save_checkpoint boolean Save processing checkpoint for iterative refinement. Checkpoints can be used with the Marker Prompt API to apply custom rules without re-parsing the entire document. Only useful for advanced workflows
false segmentation_schema string JSON Schema for document segmentation. Define segment names and descriptions to identify and extract different sections of the document (e.g., 'Executive Summary', 'Financial Data'). Useful for splitting long documents by section. See: https://documentation.datalab.to/api-reference/marker
skip_cache boolean Bypass the server-side cache and force re-processing. By default, identical requests are cached to save time and cost. Enable this to get fresh results
false strip_existing_ocr boolean Remove embedded OCR text layer from the PDF and re-run OCR from scratch. Some PDFs have low-quality embedded OCR text; this option lets you regenerate it. Ignored if force_ocr is enabled
false use_llm boolean Use an LLM to significantly improve accuracy for tables, forms, inline math, and layout detection. This merges tables across pages, handles complex layouts, and extracts form values. Will increase processing time
false 60af7e72bef7 Updated: 2/26/2026 48.5K runs
cinemasetfree.com