datalab-to/marker

Convert PDF to markdown + JSON quickly with high accuracy

Capabilities

No capability data available

Cost

Community model (estimated from hardware time)

Input Parameters

Name	Type	Description	Default	Constraints
`file` *	string (uri)	Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp	`—`	—
`additional_config`	string	Advanced configuration options as JSON string. Options include: 'disable_links' (remove hyperlinks), 'keep_pageheader_in_output' (preserve headers), 'keep_pagefooter_in_output' (preserve footers), 'filter_blank_pages' (skip empty pages), 'drop_repeated_text' (remove duplicates), and layout/table processing thresholds. Full list at: https://documentation.datalab.to/api-reference/marker	`—`	—
`block_correction_prompt`	string	Optional text prompt to guide output improvements. Use this to specify formatting preferences or extraction requirements, e.g., 'Extract all dates in YYYY-MM-DD format' or 'Keep all tables in their original structure'	`—`	—
`disable_image_extraction`	boolean	Skip extracting images from the PDF. By default, images are extracted and returned as base64-encoded data in the images field	`false`	—
`disable_ocr_math`	boolean	Disable recognition of inline mathematical expressions during OCR. By default, math expressions are detected and can be formatted as LaTeX	`false`	—
`force_ocr`	boolean	Force OCR on all pages even if text is extractable. By default, Marker automatically uses OCR only when needed (e.g., scanned PDFs). Enable this if you see garbled or incorrect text in the output	`false`	—
`format_lines`	boolean	Detect and format inline mathematical expressions and text styles (bold, italic, etc.) in the output. Useful for documents with mathematical notation	`false`	—
`include_metadata`	boolean	Include detailed metadata and JSON structure in the output. When enabled, returns json_data (hierarchical document structure with bounding boxes) and metadata (page stats, table of contents). When disabled (default), only returns markdown to reduce response size	`false`	—
`max_pages`	integer	Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive	`—`	min: 1
`mode`	string	Processing mode affecting speed and quality. 'fast': lowest latency, preserves most positional information. 'balanced': same as using use_llm. 'accurate': highest quality, slowest, preserves least positional information	`"fast"`	fast balanced accurate
`page_range`	string	Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive	`—`	—
`page_schema`	string	Structured extraction: Provide a JSON Schema to extract specific fields from your document. When provided, the model extracts only the fields you define and returns them in the 'extraction_schema_json' output field (as a JSON string containing your extracted data plus citation fields showing which parts of the document were used). The 'markdown' and 'json_data' fields will still contain the full document conversion. Example: {"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}}}. See: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview. Increases cost by 50%	`—`	—
`paginate`	boolean	Add page separators to the output. Each page will be separated by a horizontal rule containing the page number in the format: \n\n{PAGE_NUMBER}\n{48 dashes}\n\n	`false`	—
`save_checkpoint`	boolean	Save processing checkpoint for iterative refinement. Checkpoints can be used with the Marker Prompt API to apply custom rules without re-parsing the entire document. Only useful for advanced workflows	`false`	—
`segmentation_schema`	string	JSON Schema for document segmentation. Define segment names and descriptions to identify and extract different sections of the document (e.g., 'Executive Summary', 'Financial Data'). Useful for splitting long documents by section. See: https://documentation.datalab.to/api-reference/marker	`—`	—
`skip_cache`	boolean	Bypass the server-side cache and force re-processing. By default, identical requests are cached to save time and cost. Enable this to get fresh results	`false`	—
`strip_existing_ocr`	boolean	Remove embedded OCR text layer from the PDF and re-run OCR from scratch. Some PDFs have low-quality embedded OCR text; this option lets you regenerate it. Ignored if force_ocr is enabled	`false`	—
`use_llm`	boolean	Use an LLM to significantly improve accuracy for tables, forms, inline math, and layout detection. This merges tables across pages, handles complex layouts, and extracts form values. Will increase processing time	`false`	—

file required string

Input file. Must be one of: .pdf, .doc, .docx, .ppt, .pptx, .png, .jpg, .jpeg, .webp

additional_config string

Advanced configuration options as JSON string. Options include: 'disable_links' (remove hyperlinks), 'keep_pageheader_in_output' (preserve headers), 'keep_pagefooter_in_output' (preserve footers), 'filter_blank_pages' (skip empty pages), 'drop_repeated_text' (remove duplicates), and layout/table processing thresholds. Full list at: https://documentation.datalab.to/api-reference/marker

block_correction_prompt string

Optional text prompt to guide output improvements. Use this to specify formatting preferences or extraction requirements, e.g., 'Extract all dates in YYYY-MM-DD format' or 'Keep all tables in their original structure'

disable_image_extraction boolean

Skip extracting images from the PDF. By default, images are extracted and returned as base64-encoded data in the images field

Default: false

disable_ocr_math boolean

Disable recognition of inline mathematical expressions during OCR. By default, math expressions are detected and can be formatted as LaTeX

Default: false

force_ocr boolean

Force OCR on all pages even if text is extractable. By default, Marker automatically uses OCR only when needed (e.g., scanned PDFs). Enable this if you see garbled or incorrect text in the output

Default: false

format_lines boolean

Detect and format inline mathematical expressions and text styles (bold, italic, etc.) in the output. Useful for documents with mathematical notation

Default: false

include_metadata boolean

Include detailed metadata and JSON structure in the output. When enabled, returns json_data (hierarchical document structure with bounding boxes) and metadata (page stats, table of contents). When disabled (default), only returns markdown to reduce response size

Default: false

max_pages integer

Maximum number of pages to process. Cannot be specified if page_range is set - these parameters are mutually exclusive

min: 1

mode string

Processing mode affecting speed and quality. 'fast': lowest latency, preserves most positional information. 'balanced': same as using use_llm. 'accurate': highest quality, slowest, preserves least positional information

Default: "fast"

fast balanced accurate

page_range string

Page range to parse, comma separated like 0,5-10,20. Example: '0,2-4' will process pages 0, 2, 3, and 4. Cannot be specified if max_pages is set - these parameters are mutually exclusive

page_schema string

Structured extraction: Provide a JSON Schema to extract specific fields from your document. When provided, the model extracts only the fields you define and returns them in the 'extraction_schema_json' output field (as a JSON string containing your extracted data plus citation fields showing which parts of the document were used). The 'markdown' and 'json_data' fields will still contain the full document conversion. Example: {"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}}}. See: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview. Increases cost by 50%

paginate boolean

Add page separators to the output. Each page will be separated by a horizontal rule containing the page number in the format: \n\n{PAGE_NUMBER}\n{48 dashes}\n\n

Default: false

save_checkpoint boolean

Save processing checkpoint for iterative refinement. Checkpoints can be used with the Marker Prompt API to apply custom rules without re-parsing the entire document. Only useful for advanced workflows

Default: false

segmentation_schema string

JSON Schema for document segmentation. Define segment names and descriptions to identify and extract different sections of the document (e.g., 'Executive Summary', 'Financial Data'). Useful for splitting long documents by section. See: https://documentation.datalab.to/api-reference/marker

skip_cache boolean

Bypass the server-side cache and force re-processing. By default, identical requests are cached to save time and cost. Enable this to get fresh results

Default: false

strip_existing_ocr boolean

Remove embedded OCR text layer from the PDF and re-run OCR from scratch. Some PDFs have low-quality embedded OCR text; this option lets you regenerate it. Ignored if force_ocr is enabled

Default: false

use_llm boolean

Use an LLM to significantly improve accuracy for tables, forms, inline math, and layout detection. This merges tables across pages, handles complex layouts, and extracts form values. Will increase processing time

Default: false

Version: 60af7e72bef7 Updated: 2/26/2026 48.5K runs