Capture Hard Drive Folder Structure with Python

Use glob and pandas to create a snapshot of any computers current folder structure.

Suppose we want to grab the folder structure of a computer without backing up every single file. Maybe we want to index the folder structure, maybe find all of the .py files scattered across our computer, or we want to take a look at all of the files/folders that exist on another computer.  Here we’ll capture the contents (all files and folders) of our input folder, along with the contents of every sub-folder.

 

folderhierarchy

To begin, we’ll define a few functions:

def FileOrFolder(filepath):
    if "." in filepath:
        return('File')
    else:
        return('Folder')

def StillFolders(dfcolumn):
    FolderCount=0
    for item in dfcolumn:
        if item=='Folder':
            FolderCount+=1
        else:
            pass 
    if FolderCount>0:
        return('Still Folders')
    else:
        return('No Folders')

We’ll want to continue looping through each sub-folder (and their sub-folders) until there are no more folders to look in. “FileOrFolder” identifies whether a given filepath is a File or Folder. “StillFolders” looks in a single column of a DataFrame and identifies whether or not any Folders are remaining.

def find_contents(folderpath):
    #Find contents of intial input
    contents=pd.DataFrame(glob.glob(folderpath + '*'),columns=[('Path')])
    #http://stackoverflow.com/questions/12356501/pandas-create-two-new-columns-in-a-dataframe-with-values-calculated-from-a-pre?rq=1
    contents['FileOrFolder']=contents['Path'].map(FileOrFolder)
    return contents

The “find_contents” function uses glob to find all of the contents of a given folderpath. The contents is returned as a DataFrame.

In order to find all lower-level files and folders, we’ll write a short procedure to continue identifying the contents of sub-folders while the previous “order” folder still contains folders. So the full code will look something like this:

# -*- coding: utf-8 -*-
"""
Purpose: Returns all Folders and Files in a parent folder with hierarchical order.

Input: A folderpath.
Output: An excel file with four columns
            A. Index - Integer.
            B. Path - String.
            C. FileOrFolder - String.
            D. Order - Integer.  "0" is the input folderpath.
"""
import glob
import pandas as pd
from tkinter import Tk
from tkinter import filedialog

Tk().withdraw()

def FileOrFolder(filepath):
    if "." in filepath:
        return('File')
    else:
        return('Folder')

def StillFolders(dfcolumn):
    FolderCount=0
    for item in dfcolumn:
        if item=='Folder':
            FolderCount+=1
        else:
            pass 
    if FolderCount>0:
        return('Still Folders')
    else:
        return('No Folders')

def find_contents(folderpath):
    #Find contents of intial input
    contents=pd.DataFrame(glob.glob(folderpath + '*'),columns=[('Path')])
    #http://stackoverflow.com/questions/12356501/pandas-create-two-new-columns-in-a-dataframe-with-values-calculated-from-a-pre?rq=1
    contents['FileOrFolder']=contents['Path'].map(FileOrFolder)
    return contents

folder=filedialog.askdirectory(initialdir=r'C:\\',title='Please select folder')

all_levels={}
all_levels[0]=pd.DataFrame()
all_levels[0]=find_contents(folder)
all_levels[0]['Order']=0

level=1

while StillFolders(all_levels[level-1]['FileOrFolder'])=='Still Folders':
    all_levels[level]=pd.DataFrame()  
    #http://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas        
    for index, row in all_levels[level-1][all_levels[level-1]['FileOrFolder']=='Folder'].iterrows():               
        all_levels[level]=all_levels[level].append(find_contents(row['Path'] + '\\'),ignore_index=True)
        all_levels[level]['Order']=level
    level+=1

#Concatenate all dataframes in all_levels
combined_all_levels=pd.concat([all_levels[level] for level in all_levels])
#Save to excel on one sheet
combined_all_levels.to_excel('FolderHierarchyResults.xlsx',index_label='Index')

Our output is temporarily stored as a dictionary of DataFrames, which we then concatenate into a single DataFrame, and then finally use to_excel() to write our results into a spreadsheet.

Finding Words with PyPDF2

Find all instances of words in a PDF with Python’s PyPDF2 library.

This is a script for finding all instances of a given search word (or multiple search words) in a PDF.

For our example, we’ll be using a PDF of Romeo and Juliet.  In this case, our search terms are “Romeo” and “Juliet” (search is not case-sensitive).

import PyPDF2
import re

pdfFileObj=open(r'C:\Users\Craig\RomeoAndJuliet.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages

pages_text=[]
words_start_pos={}
words={}

searchwords=['romeo','juliet']

with open('FoundWordsList.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
               if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    print(page, words[page][i])

We run the script and get an output that shows each instance of each search word and the associated PDF page number:
foundsearchwords

This script can be used for a variety of other applications by updating the file path (line 4) and the search terms (line 12).

A few ideas for modification include:

  • Frequency counts of words in books/lyrics (ATS has an awesome frequency count graph generator)
  • Finding reference drawing numbers in a document
  • Identify search terms by prefixes rather than whole words
  • Identifying sheets that need to be updated
  • Using glob to iterate through multiple files

How else would you modify this script?  Let me know!

Thanks for reading!

Special thanks to these sources:

Automate the Boring Stuff with Python
ritesh_shrv on Stack Overflow

Filenames to CSV

Python script for writing all filenames in the selected folder to ‘Filenames.csv’.


import glob
from tkinter import Tk
from tkinter import filedialog

def FileList(filepath):
    return glob.glob(str(filepath) + '*')

Tk().withdraw()     
folderpath_user=filedialog.askdirectory()

if folderpath_user.endswith('\\'):
    folderpath=folderpath_user
else:
    folderpath=str(folderpath_user)+ '\\'
    
with open('Filenames.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Filepath","Filename"))    
    for file in FileList(folderpath):
        f.write('{0},{1}\n'.format(file,file.split("\\")[1]))