This is a script for finding all instances of a given search word (or multiple search words) in a PDF.
For our example, we’ll be using a PDF of Romeo and Juliet. In this case, our search terms are “Romeo” and “Juliet” (search is not case-sensitive).
import PyPDF2 import re pdfFileObj=open(r'C:\Users\Craig\RomeoAndJuliet.pdf',mode='rb') pdfReader=PyPDF2.PdfFileReader(pdfFileObj) number_of_pages=pdfReader.numPages pages_text=[] words_start_pos={} words={} searchwords=['romeo','juliet'] with open('FoundWordsList.csv', 'w') as f: f.write('{0},{1}\n'.format("Sheet Number", "Search Word")) for word in searchwords: for page in range(number_of_pages): print(page) pages_text.append(pdfReader.getPage(page).extractText()) words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())] words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]] for page in words: for i in range(0,len(words[page])): if str(words[page][i]) != 'nan': f.write('{0},{1}\n'.format(page+1, words[page][i])) print(page, words[page][i])
We run the script and get an output that shows each instance of each search word and the associated PDF page number:
This script can be used for a variety of other applications by updating the file path (line 4) and the search terms (line 12).
A few ideas for modification include:
- Frequency counts of words in books/lyrics (ATS has an awesome frequency count graph generator)
- Finding reference drawing numbers in a document
- Identify search terms by prefixes rather than whole words
- Identifying sheets that need to be updated
- Using glob to iterate through multiple files
How else would you modify this script? Let me know!
Thanks for reading!
Special thanks to these sources:
Automate the Boring Stuff with Python
ritesh_shrv on Stack Overflow