Contents
In brief
How do you walk gigabytes of PDFs and DOCX on Google Drive, extract paper titles and abstracts, and survive Gemini API quotas? A Habr write-up chains Google Apps Script, built-in Drive OCR, time triggers, LockService, and API key rotation — no paid document parsers.
What happened
The author needed to catalog a large scientific archive: exact title, short summary, and whether a specific researcher co-authored each paper. A naive Apps Script hit three walls at once.
Six-minute execution limit: OCR plus LLM per heavy PDF takes 15–40 seconds — the run dies around file 20. Binary formats: GAS cannot read PDF/DOCX natively; paid parsers are expensive. Gemini free-tier quotas → rapid HTTP 429.
The fix stacks several tricks. Hidden Google Drive OCR: via Drive API, copy PDF/DOCX to a temp Google Doc with ocr: true — same engine as manual scan open. Read text with DocumentApp, delete the temp file in finally or Drive fills with junk.
Beat the 6-minute cap with a Google Sheet as a simple DB: cache processed filenames, a minute trigger restarts the script, the new run skips finished rows and continues at file 16. LockService stops races: while one run OCRs a PDF for over a minute, the next trigger must not duplicate rows.
Gemini key rotation: an array of AI Studio keys; on 429, switch; if you wrap the pool, sleep 30s for RPM reset. Ask the LLM for JSON (responseMimeType: application/json) — title and summary in one call, no markdown fences.
Why it matters
The pattern shows Apps Script can run long background pipelines when you chunk work and guard state — cheaper than a dedicated OCR server for hundreds–thousands of Drive files, not millions.
Trade-offs: Google quota dependence and temp-file hygiene. The combo free OCR + Flash Lite + key pool can process on the order of 1,500 docs/day on three keys in ~2 hours of trigger time.
In practice
- Enable Drive API in the Apps Script editor, not only DocumentApp.
- OCR with
Drive.Files.copy,ocr: true,ocrLanguage: "ru"— try/finally delete temps. - Track progress in Google Sheets; hash/skip processed names before the loop.
- Time-driven triggers; delete triggers when the catalog finishes.
LockService.getScriptLock()per file — no parallel double-processing.GEMINI_API_KEYSpool, rotate on 429,Utilities.sleep(30000)when all keys hit RPM.responseMimeType: "application/json"— structured fields without ```json parsing.- Non-text formats (.pptx, .xlsx) → placeholder rows, zero tokens.
Takeaway
The Habr article is a practical autonomous Drive document registry: OCR without third-party services, LLM field extraction, resilience to timeouts and quotas. If your archive lives in Google cloud, adapt columns and prompts — code fragments are in the original.