Blog: Doing more with less: Sync and Async support in one line of code →

Unlock Data with Precision Key-Value Extraction

Automatically Identify and Extract Critical Information Pairs
1from docprompt import load_document_node
2from docprompt.tasks.factory import AnthropicTaskProviderFactory
3
4document_node = load_document_node("path/to/document")
5
6markerize_provider = AnthropicTaskProviderFactory().get_page_markerization_provider()
7document_markdown = markerize_provder.process_document_node()
8
9print(document_markdown[5]) # Get markdown for page 5
Load documents in seconds
Ingest PDF's into your data pipeline with ease.
Convert markdown to HTML and parse the content efficiently.
Extract structured data from documents, including headings and paragraphs.
Customize document processing with flexible provider factories and task-specific providers.
Testimonial
I used the parallelized hyperparameter tuning with Prefect and Dask (incredible) to run about 350 experiments in 30 minutes, which normally would have taken 2 days
Andrew Waterman
Data Scientis,Actium
Testimonial
I used the parallelized hyperparameter tuning with Prefect and Dask (incredible) to run about 350 experiments in 30 minutes, which normally would have taken 2 days
Andrew Waterman
Data Scientis,Actium
test_file.py
1...
2# Get all headings
3@dataclass
4class ParagraphWithHeading:
5    heading: str
6    paragraph: str
7
8document_markdown = markerize_provder.process_document_node()
9
10# Convert markdown to HTML
11html = markdown.markdown(document_markdown[5])
12
13# Parse the HTML tree
14soup = BeautifulSoup(html, 'html.parser')
15
16paragraphs_with_headings = [] 
17
18# Collect all headings and their following paragraphs
19for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
20    heading_text = heading.get_text(strip=True)
21    paragraph = heading.find_next('p')
22    if paragraph:
23        paragraphs_with_headings.append(ParagraphWithHeading(heading_text, paragraph.get_text()))

Unlock Data with Precision Key-Value Extraction

Automatically Identify and Extract Critical Information Pairs
LEARN MORE

Unlock Data with Precision Key-Value Extraction

Automatically Identify and Extract Critical Information Pairs
1...
2from langchain.text_splitter import MarkdownTextSplitter
3from itertools import chain
4
5document_markdown = markerize_provder.process_document_node()
6
7markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
8
9docs = markdown_splitter.create_documents(document_markdown.values()))

Unlock Data with Precision Key-Value Extraction

Automatically Identify and Extract Critical Information Pairs
LEARN MORE

Ready, Set, Flow

GET STARTED