How to Extract Text from a Scanned PDF — Free, No Upload

Scanned PDFs are essentially images — the text looks readable to you but a computer sees only pixels. That's where Optical Character Recognition (OCR) comes in. OCR software analyses the pixel patterns and converts them into real, selectable, searchable text.

Why You Shouldn't Upload Scanned PDFs to Random Sites

Scanned documents are often the most sensitive files you own — contracts, medical records, passports, bank statements. When you upload them to a cloud OCR tool, that file sits on a server you don't control. Most services log files for "quality improvement." Some keep them indefinitely.

ConvertPDF's OCR tool runs entirely in your browser using Tesseract.js — the same open-source engine that powers enterprise OCR systems. Your file never leaves your device.

Step-by-Step: How to Extract Text from a Scanned PDF

  1. Open the OCR tool at convertpdf.pages.dev/pages/ocrtool.html.
  2. Upload your file — drag and drop a JPG, PNG, WebP, BMP, TIFF, or PDF onto the upload area.
  3. Select your language — choose from 12 supported languages including English, French, German, Spanish, Chinese, Arabic, and Hindi.
  4. Click Extract Text — Tesseract.js loads its language model and processes the image. For a single page this takes 5–15 seconds.
  5. Copy or Download — copy the extracted text to your clipboard, or click Download .txt to save it as a plain text file.

Tips for Best OCR Accuracy

  • Use high-resolution images — 300 DPI or higher gives significantly better results than screen-resolution scans (72–96 DPI).
  • Match the language — selecting the wrong language is the single biggest cause of garbled output. If your document mixes two languages, OCR it twice.
  • Straighten the image first — Tesseract handles mild skew but struggles with pages scanned at an angle greater than 10°. Use the Rotate PDF tool to straighten before OCR.
  • Higher contrast = better results — increase contrast in your image editor before scanning if the original document is faded.

What Formats Does the OCR Tool Accept?

The tool accepts: JPEG, PNG, WebP, BMP, and TIFF image files, as well as PDF files. For PDFs, the tool renders each page as an image and processes them sequentially. Multi-page PDFs show a progress indicator with page count.

Supported Languages

English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Chinese (Simplified), Japanese, Arabic, and Hindi.

Frequently Asked Questions

Is my file uploaded to a server?

No. Tesseract.js runs directly in your browser using WebAssembly. Your file never leaves your device.

Why is my OCR output garbled?

Most likely causes: wrong language selected, low-resolution scan (use 300 DPI+), or the image has too much noise/staining. Try improving image quality first.

Can I OCR a multi-page PDF?

Yes. The tool processes each page in sequence and combines all extracted text into one output. Large PDFs may take a minute or two depending on your device.

What if the text isn't perfectly accurate?

OCR is never 100% perfect, especially on handwriting or stylised fonts. Review the output and correct any errors before use.

Ready to try it?

Open OCR Tool →