Suppose we want to grab the folder structure of a computer without backing up every single file. Maybe we want to index the folder structure, maybe find all of the .py files scattered across our computer, or we want to take a look at all of the files/folders that exist on another computer. Here we’ll capture the contents (all files and folders) of our input folder, along with the contents of every sub-folder.
To begin, we’ll define a few functions:
def FileOrFolder(filepath): if "." in filepath: return('File') else: return('Folder') def StillFolders(dfcolumn): FolderCount=0 for item in dfcolumn: if item=='Folder': FolderCount+=1 else: pass if FolderCount>0: return('Still Folders') else: return('No Folders')
We’ll want to continue looping through each sub-folder (and their sub-folders) until there are no more folders to look in. “FileOrFolder” identifies whether a given filepath is a File or Folder. “StillFolders” looks in a single column of a DataFrame and identifies whether or not any Folders are remaining.
def find_contents(folderpath): #Find contents of intial input contents=pd.DataFrame(glob.glob(folderpath + '*'),columns=[('Path')]) #http://stackoverflow.com/questions/12356501/pandas-create-two-new-columns-in-a-dataframe-with-values-calculated-from-a-pre?rq=1 contents['FileOrFolder']=contents['Path'].map(FileOrFolder) return contents
The “find_contents” function uses glob to find all of the contents of a given folderpath. The contents is returned as a DataFrame.
In order to find all lower-level files and folders, we’ll write a short procedure to continue identifying the contents of sub-folders while the previous “order” folder still contains folders. So the full code will look something like this:
# -*- coding: utf-8 -*- """ Purpose: Returns all Folders and Files in a parent folder with hierarchical order. Input: A folderpath. Output: An excel file with four columns A. Index - Integer. B. Path - String. C. FileOrFolder - String. D. Order - Integer. "0" is the input folderpath. """ import glob import pandas as pd from tkinter import Tk from tkinter import filedialog Tk().withdraw() def FileOrFolder(filepath): if "." in filepath: return('File') else: return('Folder') def StillFolders(dfcolumn): FolderCount=0 for item in dfcolumn: if item=='Folder': FolderCount+=1 else: pass if FolderCount>0: return('Still Folders') else: return('No Folders') def find_contents(folderpath): #Find contents of intial input contents=pd.DataFrame(glob.glob(folderpath + '*'),columns=[('Path')]) #http://stackoverflow.com/questions/12356501/pandas-create-two-new-columns-in-a-dataframe-with-values-calculated-from-a-pre?rq=1 contents['FileOrFolder']=contents['Path'].map(FileOrFolder) return contents folder=filedialog.askdirectory(initialdir=r'C:\\',title='Please select folder') all_levels={} all_levels[0]=pd.DataFrame() all_levels[0]=find_contents(folder) all_levels[0]['Order']=0 level=1 while StillFolders(all_levels[level-1]['FileOrFolder'])=='Still Folders': all_levels[level]=pd.DataFrame() #http://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas for index, row in all_levels[level-1][all_levels[level-1]['FileOrFolder']=='Folder'].iterrows(): all_levels[level]=all_levels[level].append(find_contents(row['Path'] + '\\'),ignore_index=True) all_levels[level]['Order']=level level+=1 #Concatenate all dataframes in all_levels combined_all_levels=pd.concat([all_levels[level] for level in all_levels]) #Save to excel on one sheet combined_all_levels.to_excel('FolderHierarchyResults.xlsx',index_label='Index')
Our output is temporarily stored as a dictionary of DataFrames, which we then concatenate into a single DataFrame, and then finally use to_excel() to write our results into a spreadsheet.