Pages

Monday, 16 December 2024

MarkItDown

 Python tool for converting pdf file and office documents to Markdown file.

PyPI

The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)

It presently supports:

  • PDF (.pdf)
  • PowerPoint (.pptx)
  • Word (.docx)
  • Excel (.xlsx)
  • Images (EXIF metadata, and OCR)
  • Audio (EXIF metadata, and speech transcription)
  • HTML (special handling of Wikipedia, etc.)
  • Various other text-based formats (csv, json, xml, etc.)
  • ZIP (Iterates over contents and converts each file)

Installation

You can install markitdown using pip:

pip install markitdown

or from the source

pip install -e .

Usage

To use this as a command-line utility, install it and then run it like this:

markitdown path-to-file.pdf

This will output Markdown to standard output. You can save it like this:

markitdown path-to-file.pdf > document.md

You can pipe content to standard input by omitting the argument:

cat path-to-file.pdf | markitdown

Running Tests

To run tests, install hatch using pip or other methods as described here.

pip install hatch
hatch shell
hatch test

Running Pre-commit Checks

Please run the pre-commit checks before submitting a PR.

pre-commit run --all-files

from https://github.com/microsoft/markitdown

 

 

 

No comments:

Post a Comment