Python tool for converting pdf file and office documents to Markdown file.
The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
It presently supports:
- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
- ZIP (Iterates over contents and converts each file)
You can install markitdown
using pip:
pip install markitdown
or from the source
pip install -e .
Usage
To use this as a command-line utility, install it and then run it like this:
markitdown path-to-file.pdf
This will output Markdown to standard output. You can save it like this:
markitdown path-to-file.pdf > document.md
You can pipe content to standard input by omitting the argument:
cat path-to-file.pdf | markitdown
To run tests, install hatch
using pip
or other methods as described here.
pip install hatch
hatch shell
hatch test
Please run the pre-commit checks before submitting a PR.
pre-commit run --all-files
from https://github.com/microsoft/markitdown
No comments:
Post a Comment