Extract images or text from PDF
Sometimes you end up in situation, where you have a PDF file which has text and images, and you want to use them in other application. Usually people think that pdf is like cut in stone, but that is not true. Actually is is quite easy to extract stuff out from pdf-document. You have several options for this, there are some command line tools available and document viewer Okular can also copy text and images.
poppler-utils
In Ubuntu this is automatically installed. But if it is not installed on your system, then
sudo apt-get install poppler-utils
This package contains several command line tools, but lets focus on two of them. pdfimages extracts all images from pdf-file, and pdftotext converts content of pdf-file to text. Command
pdfimages my_pdffile.pdf ./imagename
will extract all images from document my_pdffile.pdf, save them on your current directory, and name them in sequence (“imagename-001.xxx”,”imagename-002.xxx” etc). Notice that images are by default saved in PPM (non-monochrome) or PBM (monochrome) format. With -j handle DCT format images are saved as jpeg files. However, you can batch convert all images later with ImageMagick.
pdftotext -layout my_pdffile.pdf textfile.txt
This command extracts the text from pdf-file in reading order (-layout) and puts it in textfile.txt. You can also send this text stdout by replacing target file name with ‘-‘.
Okular
Okular is a document viewer that can handle most portable document type including PDF. Okular has Selection tool which allows you to select areas on document, and copy it either as image or text to clipboard (and then paste to other application), or save it as file.
Thanks for explanation !
Is it somehow possible to link the extracted images to the position in the text ?
I have some 20 pdf files and I want to extract all the images from each pdf and name them accordingly, how to do this.