You are viewing archived messages.
Go here to search the history.

Marek Rogalski 2025-03-19 15:32:14

I'm getting feedback about the state of the game using basic OCR now. Unfortunately the OCR that I'm using is optimized towards "natural" text - so it doesn't handle game UIs too well.

🎥 OCR.mp4

Tom Larkworthy 2025-03-19 16:11:27

the deep learning models are now state of the art for OCR imho. they can do natural scenes as well

Marek Rogalski 2025-03-19 16:13:51

I've spent some time digging through huggingface but the OCR models I've found tended to be 1GB+

The also have a ton of "image-to-text" models but unfortunately it's not the same as "OCR".

I kind of wish there was a "ocr.cpp" repo somewhere - just like "llama.cpp" or "whisper.cpp"...

Tom Larkworthy 2025-03-19 16:18:07

yeah I would be looking for multi-modal LLM, but definitely they are gonna be massive for local use so that would be a good reason to use classical OCR.

That said, Skyrim 5.6 GB so gamers are quite tolerant of large downloads.

Tom Larkworthy 2025-03-19 16:20:04

"Here's an example of how to run llama.cpp's built-in HTTP server. This example uses LLaVA v1.5-7B, a multimodal LLM that works with llama.cpp's recently-added support for image inputs." so its seems like multi modal is supported by llama.cpp now (says llamafile's README)

Marek Rogalski 2025-03-19 16:22:56

Desktop capture + Multimodal LLMs + Fake input sounds like a match made in heaven

Tom Larkworthy 2025-03-19 16:25:37

I know one local doing something in this area github.com/e2b-dev/desktop

Marek Rogalski 2025-03-19 16:25:46

BTW Tesseract OCR that I'm using clearly hasn't been updated in quite a long time. It's docs praise the new LSTM-based engine. I wonder how a modern convnet hierarchy + attention architecture would work...

Tom Larkworthy 2025-03-19 16:29:48

At work we switched out document OCR from classical to LLM and got better results. docs are literally the ideal use case for classical OCR but still LLMs seem to outperform coz they "get" the task and the words. For your use case the OCR is not on docs so I imagine the delta is even better.