Extract images or text from PDF

Sometimes you end up in situation, where you have a PDF file which has text and images, and you want to use them in other application. Usually people think that pdf is like cut in stone, but that is not true. Actually is is quite easy to extract stuff out from pdf-document. You have several options for this, there are some command line tools available and document viewer Okular can also copy text and images.

In Ubuntu this is automatically installed. But if it is not installed on your system, then

sudo apt-get install poppler-utils

This package contains several command line tools, but lets focus on two of them. pdfimages extracts all images from pdf-file, and pdftotext converts content of pdf-file to text. Command

pdfimages my_pdffile.pdf ./imagename

will extract all images from document my_pdffile.pdf, save them on your current directory, and name them in sequence (“imagename-001.xxx”,”imagename-002.xxx” etc). Notice that images are by default saved in PPM (non-monochrome) or PBM (monochrome) format. With -j handle DCT format images are saved as jpeg files. However, you can batch convert all images later with ImageMagick.

pdftotext -layout my_pdffile.pdf textfile.txt

This command extracts the text from pdf-file in reading order (-layout) and puts it in textfile.txt. You can also send this text stdout by replacing target file name with ‘-‘.

Okular is a document viewer that can handle most portable document type including PDF. Okular has Selection tool which allows you to select areas on document, and copy it either as image or text to clipboard (and then paste to other application), or save it as file.

2 Responses to “Extract images or text from PDF”
  1. Wolfgang says:

    Thanks for explanation !

    Is it somehow possible to link the extracted images to the position in the text ?

  2. smp says:

    I have some 20 pdf files and I want to extract all the images from each pdf and name them accordingly, how to do this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 105 other followers

%d bloggers like this: