f2CBVx

ppt.cc/fKlBax ppt.cc/fwlgFx ppt.cc/fVjECx ppt.cc/fEnHsx ppt.cc/fRZTnx ppt.cc/fSZ3cx ppt.cc/fLOuCx ppt.cc/fE9Nux ppt.cc/fL5Kyx ppt.cc/f71Yqx tecmint.com linuxcool.com linux.die.net linux.it.net.cn ostechnix.com unix.com ubuntugeek.com runoob.com man.linuxde.net ppt.cc/fwpCex ppt.cc/fxcLIx ppt.cc/foX6Ux linuxprobe.com linuxtechi.com howtoforge.com linuxstory.org systutorials.com ghacks.net linuxopsys.com ppt.cc/ffAGfx ppt.cc/fJbezx ppt.cc/fNIQDx ppt.cc/fCSllx ppt.cc/fybDVx ppt.cc/fIMQxx

Thursday, 17 October 2024

Surya is a document OCR toolkit

OCR, layout analysis, reading order, table recognition in 90+ languages

www.datalab.to

Surya is a document OCR toolkit that does:

OCR in 90+ languages that benchmarks favorably vs cloud services
Line-level text detection in any language
Layout analysis (table, image, header, etc detection)
Reading order detection
Table recognition (detecting rows/columns)

It works on a range of documents (see usage and benchmarks for more details).

Detection	OCR

Layout	Reading Order

Table Recognition

Surya is named for the Hindu sun god, who has universal vision.

Community

Discord is where we discuss future development.

Examples

Name	Detection	OCR	Layout	Order	Table Rec
Japanese	Image	Image	Image	Image	Image
Chinese	Image	Image	Image	Image
Hindi	Image	Image	Image	Image
Arabic	Image	Image	Image	Image
Chinese + Hindi	Image	Image	Image	Image
Presentation	Image	Image	Image	Image	Image
Scientific Paper	Image	Image	Image	Image	Image
Scanned Document	Image	Image	Image	Image	Image
New York Times	Image	Image	Image	Image
Scanned Form	Image	Image	Image	Image	Image
Textbook	Image	Image	Image	Image

Hosted API

There is a hosted API for all surya models available here:

Works with PDF, images, word docs, and powerpoints
Consistent speed, with no latency spikes
High reliability and uptime

Commercial usage

I want surya to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the Datalab API. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.

Installation

You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

pip install surya-ocr

Model weights will automatically download the first time you run surya.

Usage

Inspect the settings in surya/settings.py. You can override any settings with environment variables.
Your torch device will be automatically detected, but you can override this. For example, TORCH_DEVICE=cuda.

Interactive App

I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:

pip install streamlit
surya_gui

OCR (text recognition)

This command will write out a json file with the detected text and bboxes:

surya_ocr DATA_PATH

DATA_PATH can be an image, pdf, or folder of images/pdfs
--langs is an optional (but recommended) argument that specifies the language(s) to use for OCR. You can comma separate multiple languages. Use the language name or two-letter ISO code from here. Surya supports the 90+ languages found in surya/languages.py.
--lang_file if you want to use a different language for different PDFs/images, you can optionally specify languages in a file. The format is a JSON dict with the keys being filenames and the values as a list, like {"file1.pdf": ["en", "hi"], "file2.pdf": ["en"]}.
--images will save images of the pages and detected text lines (optional)
--results_dir specifies the directory to save results to instead of the default
--max specifies the maximum number of pages to process if you don't want to process everything
--start_page specifies the page number to start processing from

The results.json file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

text_lines - the detected text and bounding boxes for each line
- text - the text in the line
- confidence - the confidence of the model in the detected text (0-1)
- polygon - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
- bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
languages - the languages specified for the page
page - the page number in the file
image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance tips

Setting the RECOGNITION_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 40MB of VRAM, so very high batch sizes are possible. The default is a batch size 512, which will use about 20GB of VRAM. Depending on your CPU core count, it may help, too - the default CPU batch size is 32.

from https://github.com/VikParuchuri/surya