Module thunderhopper.filetools
Functions
def assert_dir(dir_path, as_path=False)
-
Expand source code
def assert_dir(dir_path, as_path=False): """ Checks if folder exists and creates it, if necessary. Returns pathlib-standardized version of the given path as either a string or a pathlib.Path object. Parameters ---------- dir_path : str or pathlib.Path object Relative or absolute path to folder. Creates full directory tree, including any missing parent folders. as_path : bool, optional If True, returns folder path as a pathlib.Path object. Else, returns a string. The default is False. """ # Input interpretation: if not isinstance(dir_path, pathlib.Path): dir_path = pathlib.Path(dir_path) # Ensure folder existence: if not dir_path.exists(): print(f'WARNING: Directory "{str(dir_path)}" does not exist. '\ 'Added missing folders.') dir_path.mkdir(parents=True) return dir_path if as_path else str(dir_path)
Checks if folder exists and creates it, if necessary. Returns pathlib-standardized version of the given path as either a string or a pathlib.Path object.
Parameters
dir_path
:str
orpathlib.Path object
- Relative or absolute path to folder. Creates full directory tree, including any missing parent folders.
as_path
:bool
, optional- If True, returns folder path as a pathlib.Path object. Else, returns a string. The default is False.
def check_extension(path, extension, n_suffixes=1)
-
Expand source code
def check_extension(path, extension, n_suffixes=1): """ Ensures that the given path has a correct file extension. Controls for presence of the specified target extension(s) in path. If path has no extension, appends the first target extension. If path has a different extension, replaces it accordingly. Can handle paths with multiple suffixes (more than a single "." in path). Parameters ---------- path : str Absolute or relative path to check. Always treated as a file path. extension : str or tuple or list of str Target extension(s) to search for in path. Returns the given path if it contains any target extension. Else, uses the first provided target extension to complete the path or replace the current extension. n_suffixes : int, optional Number of path suffixes (segments starting with ".") to treat as the current file extension to be checked and replaced, if necessary. If 0, appends the target extension to the path regardless of any existing suffixes. If > 1, assumes a chained extension (e.g., ".tar.gz") that comprises the last n_suffixes segments. Warns about paths that contain more suffixes than expected. The default is 1. Returns ------- path : str Checked path with the correct file extension. Returns either the given path or a modified version with the target extension appended or in place of the current extension. """ # Assert iterable: if not isinstance(extension, (list, tuple)): extension = [extension] # Assert leading dot in given target extension(s): extension = [ex if ex.startswith('.') else '.' + ex for ex in extension] # Get path segments with a leading dot: suffixes = pathlib.Path(path).suffixes # Take all identified: if n_suffixes is None: n_suffixes = len(suffixes) # Report violated expectations: if len(suffixes) > n_suffixes: print('WARNING: Path contains more suffixes (".") than expected.') # Append (first) target extension: if not suffixes or n_suffixes == 0: return path + extension[0] # Keep or replace current extension: current_ext = ''.join(suffixes[-n_suffixes:]) if current_ext in extension: return path return path[:-len(current_ext)] + extension[0]
Ensures that the given path has a correct file extension. Controls for presence of the specified target extension(s) in path. If path has no extension, appends the first target extension. If path has a different extension, replaces it accordingly. Can handle paths with multiple suffixes (more than a single "." in path).
Parameters
path
:str
- Absolute or relative path to check. Always treated as a file path.
extension
:str
ortuple
orlist
ofstr
- Target extension(s) to search for in path. Returns the given path if it contains any target extension. Else, uses the first provided target extension to complete the path or replace the current extension.
n_suffixes
:int
, optional- Number of path suffixes (segments starting with ".") to treat as the current file extension to be checked and replaced, if necessary. If 0, appends the target extension to the path regardless of any existing suffixes. If > 1, assumes a chained extension (e.g., ".tar.gz") that comprises the last n_suffixes segments. Warns about paths that contain more suffixes than expected. The default is 1.
Returns
path
:str
- Checked path with the correct file extension. Returns either the given path or a modified version with the target extension appended or in place of the current extension.
def crop_paths(paths)
-
Expand source code
def crop_paths(paths): """ Crops parent folders from paths and removes any file extensions. Parameters ---------- paths : str or list of str (m,) Absolute or relative paths to the desired target folders or files. Returns ------- cropped : list of str (m,) Bare names of the folders and files in paths. """ # Assert iterable: paths = check_list(paths) # Reduce paths to filenames: cropped = [pathlib.Path(path).stem for path in paths] return cropped
Crops parent folders from paths and removes any file extensions.
Parameters
paths
:str
orlist
ofstr (m,)
- Absolute or relative paths to the desired target folders or files.
Returns
cropped
:list
ofstr (m,)
- Bare names of the folders and files in paths.
def search_files(keywords='*',
excl=[],
incl=[],
dir='../data/raw/',
ext='*',
subdirs=False,
resolve=True,
as_path=False)-
Expand source code
def search_files(keywords='*', excl=[], incl=[], dir='../data/raw/', ext='*', subdirs=False, resolve=True, as_path=False): """ Multi-keyword search among file paths in the given directory. Allows for post-search exclusion/inclusion criteria before returning. Uses pathlib's globbing, supporting path resolution and recursive subdirectory search. KEYWORD SYNTAX (current songdetector archive): Genus: Uppercase ('Chorthippus') Species: Lowercase ('_mollis') (Sub-species: Lowercase ('-ignifer')) Full name: 'Chorthippus_mollis-ignifer' Source: 'BM93', 'BM04', 'DJN', 'GBC', 'FTN' Temperature: '_T' or 'C_' Doubletag: 'DT' (potential duplicate segments within recording) Caution: 'CAUTION' (Only BM: '93/'04 might be same recording) Parameters ---------- keywords : str or list of str (m,) Search keywords for repetitive globbing of file paths. Multiple keyword hits do not result in duplicates in the returned file list. If any keyword is '*', performs an all-out wildcard search for the given file extension(s) in the specified directory. Excluder and includer keywords are applied normally, if present. The default is '*'. excl : str or list of str (n,), optional Excluder keywords. Omits all globbed paths that contain any excluders. Excluders are applied before and take precedence over includers. The default is []. incl : str or list of str (p,), optional Includer keywords. Omits all globbed paths that lack any includers. Includers are applied after excluders. The default is []. dir : str or pathlib.Path object, optional Relative or absolute path to the wider directory in which to perform the keyword search. The default is '../data/Raw/'. ext : str, optional File extension to narrow keyword search to the desired format (for example, 'wav'). Does not have to start with '.' The default is '*', corresponding to any file type. subdirs : bool, optional If True, performs recursive search in all sub-directories of dir. Else, strictly searches the given folder only. The default is False. resolve : bool, optional If True, converts dir to an absolute path, resolving any symbolic links such as '..'. Raises an error if the given directory does not exist. Determines the format of the returned file paths (absolute/relative). The default is True. as_path : bool, optional If True, returns file paths as pathlib.Path objects. Else, returns strings. The default is False. Returns ------- file_list : list of str (q,) Paths to files that match the specified keyword criteria in alphabetical order. Each path in the list is unique, regardless of the number of matching keywords. """ # Input interpretation: if not ext.startswith('.'): ext = '.' + ext if not isinstance(dir, pathlib.Path): dir = pathlib.Path(dir) # Assert iterables: keywords, excl, incl = check_list(keywords, excl, incl) # Enable recursive folder search: wild_dir = '**/' if subdirs else '' if resolve: # Make absolute, check existence: dir = dir.resolve(strict=True) if '*' in keywords: # Pure wildcard search: keywords = ['*'] else: # Wildcard buffer: ext = '*' + ext wild_dir += '*' # Search for files that contain any search keywords: files = [list(dir.glob(f'{wild_dir}{kw}{ext}')) for kw in keywords] # Join lists, remove duplicates, sort alphabetically, Path to string: file_list = [str(file) for file in sorted(list(set(sum(files, []))))] if excl: # Omit all files that contain any excluder keywords: file_list = [f for f in file_list if not any(ex in f for ex in excl)] if incl: # Omit all files that lack any includer keywords: file_list = [f for f in file_list if all(inc in f for inc in incl)] if as_path: # Convert strings back to Path objects: file_list = [pathlib.Path(f) for f in file_list] return file_list
Multi-keyword search among file paths in the given directory. Allows for post-search exclusion/inclusion criteria before returning. Uses pathlib's globbing, supporting path resolution and recursive subdirectory search.
KEYWORD SYNTAX (current songdetector archive): Genus: Uppercase ('Chorthippus') Species: Lowercase ('_mollis') (Sub-species: Lowercase ('-ignifer')) Full name: 'Chorthippus_mollis-ignifer' Source: 'BM93', 'BM04', 'DJN', 'GBC', 'FTN' Temperature: '_T' or 'C_' Doubletag: 'DT' (potential duplicate segments within recording) Caution: 'CAUTION' (Only BM: '93/'04 might be same recording)
Parameters
keywords
:str
orlist
ofstr (m,)
- Search keywords for repetitive globbing of file paths. Multiple keyword hits do not result in duplicates in the returned file list. If any keyword is '', performs an all-out wildcard search for the given file extension(s) in the specified directory. Excluder and includer keywords are applied normally, if present. The default is ''.
excl
:str
orlist
ofstr (n,)
, optional- Excluder keywords. Omits all globbed paths that contain any excluders. Excluders are applied before and take precedence over includers. The default is [].
incl
:str
orlist
ofstr (p,)
, optional- Includer keywords. Omits all globbed paths that lack any includers. Includers are applied after excluders. The default is [].
dir
:str
orpathlib.Path object
, optional- Relative or absolute path to the wider directory in which to perform the keyword search. The default is '../data/Raw/'.
ext
:str
, optional- File extension to narrow keyword search to the desired format (for example, 'wav'). Does not have to start with '.' The default is '*', corresponding to any file type.
subdirs
:bool
, optional- If True, performs recursive search in all sub-directories of dir. Else, strictly searches the given folder only. The default is False.
resolve
:bool
, optional- If True, converts dir to an absolute path, resolving any symbolic links such as '..'. Raises an error if the given directory does not exist. Determines the format of the returned file paths (absolute/relative). The default is True.
as_path
:bool
, optional- If True, returns file paths as pathlib.Path objects. Else, returns strings. The default is False.
Returns
file_list
:list
ofstr (q,)
- Paths to files that match the specified keyword criteria in alphabetical order. Each path in the list is unique, regardless of the number of matching keywords.
def to_archive(data)
-
Expand source code
def to_archive(data): """ Re-codes None entries as empty 1D arrays for writing to a npz archive. Avoids creation of arrays of data type object, which require pickling. Parameters ---------- data : dict Data to be written to file using np.savez(). Other hard-to-serialize types like dictionaries, sets, or inhomogenous lists are not treated. Returns ------- data : dict Data in storable format. """ return {k: np.array([]) if v is None else v for k, v in data.items()}
Re-codes None entries as empty 1D arrays for writing to a npz archive. Avoids creation of arrays of data type object, which require pickling.
Parameters
data
:dict
- Data to be written to file using np.savez(). Other hard-to-serialize types like dictionaries, sets, or inhomogenous lists are not treated.
Returns
data
:dict
- Data in storable format.
def from_archive(data)
-
Expand source code
def from_archive(data): """ Retrieves data from a npz archive, restoring much of the original form. Unpacks single ints, floats, bools, and strings from their 0D array containers. Converts empty 1D arrays back to Nones. Parameters ---------- data : dict or NPZ archive object Data loaded from a npz archive using np.load(). Arrays of other or more complex data types like object are not treated. Returns ------- data : dict Data in working format. """ # Input interpreation: if not isinstance(data, dict): data = dict(data) # Convert arrays where necessary: for key, value in data.items(): # Restore Nones from empty 1D arrays: if value.ndim == 1 and not value.size: data[key] = None # Restore singular entries from 0D arrays: elif value.ndim == 0 and value.size == 1: try: if np.isdtype(value.dtype, 'signed integer'): data[key] = int(value) elif np.isdtype(value.dtype, 'real floating'): data[key] = float(value) elif np.isdtype(value.dtype, 'bool'): data[key] = bool(value) elif np.isdtype(value.dtype, np.str_): data[key] = str(value) except AttributeError: # np.isdtype() was added in numpy version 2.0! # This might also work: data[key] = value.item() return data
Retrieves data from a npz archive, restoring much of the original form. Unpacks single ints, floats, bools, and strings from their 0D array containers. Converts empty 1D arrays back to Nones.
Parameters
data
:dict
orNPZ archive object
- Data loaded from a npz archive using np.load(). Arrays of other or more complex data types like object are not treated.
Returns
data
:dict
- Data in working format.
def load_npz(path, files=[], keywords=[], prefix='', suffix='')
-
Expand source code
def load_npz(path, files=[], keywords=[], prefix='', suffix=''): #TODO: Document prefix and suffix keyword arguments. """ Pre-loads npz archive and loads the contained npy files into memory. Returns a dictionary of arrays. Output can be limited to files whose names are explicitly given or contain any of the specified keywords. The archive is closed again upon retrieving the requested data. Parameters ---------- path : str or pathlib.Path Absolute or relative path to the npz archive. files : str or list or tuple of str (m,), optional Selection of names of npy files to load from the archive. Ignores file names that cannot be found in archive.files. The default is []. keywords : str or list or tuple of str (n,), optional Keywords to match against npy file names. The default is []. Returns ------- data : dict of arrays (p,) Contents of the loaded npy files retrieved from the npz archive. If no files or keywords are specified, returns the entire archive. Else, only returns the subset of files that match the given criteria. """ # Get zipped npy files: archive = np.load(path) # Load data into memory: if not files and not keywords: # Unselective early exit: data = dict(archive) else: # Ensure iterable: if isinstance(keywords, str): keywords = [keywords] # Select files to load by name or keywords: selected = lambda f: f in files or any(kw in f for kw in keywords) data = {f: archive[f] for f in filter(selected, archive.files)} # Safe return: archive.close() return unmerge_dicts(data, prefix, suffix) if prefix or suffix else data
Pre-loads npz archive and loads the contained npy files into memory. Returns a dictionary of arrays. Output can be limited to files whose names are explicitly given or contain any of the specified keywords. The archive is closed again upon retrieving the requested data.
Parameters
path
:str
orpathlib.Path
- Absolute or relative path to the npz archive.
files
:str
orlist
ortuple
ofstr (m,)
, optional- Selection of names of npy files to load from the archive. Ignores file names that cannot be found in archive.files. The default is [].
keywords
:str
orlist
ortuple
ofstr (n,)
, optional- Keywords to match against npy file names. The default is [].
Returns
data
:dict
ofarrays (p,)
- Contents of the loaded npy files retrieved from the npz archive. If no files or keywords are specified, returns the entire archive. Else, only returns the subset of files that match the given criteria.
def expand_npz(path, **data)
-
Expand source code
def expand_npz(path, **data): """ Appends passed variables to an existing npz archive of npy files. Variables must be compatible with np.save(). Each variable is first saved as in-memory buffer, so no extra files are created on disk apart from those added to the ZIP archive. Does not support file overwriting. Parameters ---------- path : str or pathlib.Path Absolute or relative path to the existing npz archive. **data : dict Keyword arguments defining each variable and the name of the npy file to be created in the archive. Must be compatible with np.save(). May not contain any file names that are already present in the archive. """ for name, variable in data.items(): # Create in-memory npy file: file_buffer = io.BytesIO() np.save(file_buffer, variable) # Rewind to start: file_buffer.seek(0) # Write npy file to existing npz archive: with zipfile.ZipFile(path, mode='a') as zipf: zipf.writestr(name + '.npy', file_buffer.read()) return None
Appends passed variables to an existing npz archive of npy files. Variables must be compatible with np.save(). Each variable is first saved as in-memory buffer, so no extra files are created on disk apart from those added to the ZIP archive. Does not support file overwriting.
Parameters
path
:str
orpathlib.Path
- Absolute or relative path to the existing npz archive.
**data
:dict
- Keyword arguments defining each variable and the name of the npy file to be created in the archive. Must be compatible with np.save(). May not contain any file names that are already present in the archive.
def trim_npz(path, files=[], keywords=[])
-
Expand source code
def trim_npz(path, files=[], keywords=[]): """ Removes specified npy files from an existing npz archive. Deleted files can be specified by name or by the presence of keywords. If any files are to be deleted, rewrites the archive without them. Parameters ---------- path : str or pathlib.Path Absolute or relative path to the existing npz archive. files : str or list or tuple of str (m,), optional Selection of names of npy files to remove from the archive. Ignores file names that cannot be found in archive.files. The default is []. keywords : str or list or tuple of str (n,), optional Keywords to match against npy file names. The default is []. """ # Ensure iterable: if isinstance(keywords, str): keywords = [keywords] if isinstance(files, str): files = [files] # Select files to keep by name or keywords: files = [f'{file}.npy' for file in files] selected = lambda f: f not in files and not any(kw in f for kw in keywords) # List existing npy files in npz archive: with zipfile.ZipFile(path, mode='r') as zipf: archive_files = zipf.namelist() keep = {f: zipf.read(f) for f in filter(selected, archive_files)} # Check if rewrite can be avoided: if len(keep) == len(archive_files): return None # Recreate archive without deleted files: with zipfile.ZipFile(path, 'w') as zipf: for name, file in keep.items(): zipf.writestr(name, file) return None
Removes specified npy files from an existing npz archive. Deleted files can be specified by name or by the presence of keywords. If any files are to be deleted, rewrites the archive without them.
Parameters
path
:str
orpathlib.Path
- Absolute or relative path to the existing npz archive.
files
:str
orlist
ortuple
ofstr (m,)
, optional- Selection of names of npy files to remove from the archive. Ignores file names that cannot be found in archive.files. The default is [].
keywords
:str
orlist
ortuple
ofstr (n,)
, optional- Keywords to match against npy file names. The default is [].
def update_npz(path, **data)
-
Expand source code
def update_npz(path, **data): """ Inserts passed variables to an existing npz archive of npy files. Variables must be compatible with np.save(). If all variables are novel to the archive, calls expand_npz() to append them. If any file name is already present in the archive, rewrites the archive without the files to be overwritten, then calls expand_npz() to append the new variables. Parameters ---------- path : str or pathlib.Path Absolute or relative path to the existing npz archive. **data : dict Keyword arguments defining each variable and the name of the npy file to be inserted in the archive. Must be compatible with np.save(). """ # List existing npy files in npz archive: with zipfile.ZipFile(path, mode='r') as zipf: # Get current archive contents: archive_files = zipf.namelist() # Identify archive files that will not be updated: untouched = [f for f in archive_files if f[:-4] not in data.keys()] # Check if file overwrite can be avoided: if len(untouched) == len(archive_files): # Pure append early exit: expand_npz(path, **data) return None # Gather contents of non-updated files: untouched = {f: zipf.read(f) for f in untouched} # Recreate archive without overwrites: with zipfile.ZipFile(path, 'w') as zipf: for name, file in untouched.items(): zipf.writestr(name, file) # Append to new archive: expand_npz(path, **data) return None
Inserts passed variables to an existing npz archive of npy files. Variables must be compatible with np.save(). If all variables are novel to the archive, calls expand_npz() to append them. If any file name is already present in the archive, rewrites the archive without the files to be overwritten, then calls expand_npz() to append the new variables.
Parameters
path
:str
orpathlib.Path
- Absolute or relative path to the existing npz archive.
**data
:dict
- Keyword arguments defining each variable and the name of the npy file to be inserted in the archive. Must be compatible with np.save().
def write_json(data, path, make_dir=True)
-
Expand source code
def write_json(data, path, make_dir=True): """ Writes data collection in dictionary format to .json or .txt file. Parameters ---------- data : dict Data to be saved. May contain different types of data. Data is rendered JSON-serializable before writing: Numpy int/float scalars to built-in types, Numpy ndarrays (and ndarrays in a list) to lists. Conversion may cause loss of precision! path : str Absolute or relative path to file in which data is saved. make_dir : bool, optional If True, creates missing parent folders in path. Else, raises a FileNotFoundError if directory does not exist. The default is True. """ # JSON serialization: dictionary = data.copy() for key in dictionary.keys(): value = dictionary[key] data_type = str(type(value)) # Numpy integer types to built-in int: if 'int' in data_type and 'numpy' in data_type: dictionary.update({key: int(value)}) # Numpy float types to built-in float: elif 'float' in data_type and 'numpy' in data_type: dictionary.update({key: float(value)}) # Numpy ndarrays to list: elif 'ndarray' in data_type: dictionary.update({key: value.tolist()}) # Numpy ndarrays in list to lists: elif 'list' in data_type: for i, val in enumerate(value): if 'ndarray' in str(type(val)): value[i] = val.tolist() dictionary.update({key: value}) if make_dir: # Optional folder creation: assert_dir(str(pathlib.Path(path).parent)) # Write data to file: with open(path, 'w') as file: json.dump(dictionary, file, ensure_ascii=False, indent=4) return None
Writes data collection in dictionary format to .json or .txt file.
Parameters
data
:dict
- Data to be saved. May contain different types of data. Data is rendered JSON-serializable before writing: Numpy int/float scalars to built-in types, Numpy ndarrays (and ndarrays in a list) to lists. Conversion may cause loss of precision!
path
:str
- Absolute or relative path to file in which data is saved.
make_dir
:bool
, optional- If True, creates missing parent folders in path. Else, raises a FileNotFoundError if directory does not exist. The default is True.
def load_json(path, restore=None)
-
Expand source code
def load_json(path, restore=None): """ Loads data collection in dictionary format from .json or .txt file. Parameters ---------- path : str Absolute or relative path to file in which data is stored. restore : str, optional If specified, converts stored lists back to Numpy ndarrays. If 'full', calls np.array() on entire list. If 'inner', leaves outer wrapper list as is and calls np.array() on all of its elements that are lists. Lists that contain any strings are always treated in the latter way. May have unexpected outcomes for higher levels of nesting! The default is None. Returns ------- data : dict Dictionary containing loaded data. """ # Read data from file: with open(path, 'r') as file: data = json.load(file) if restore in ['full', 'inner']: # Revert array JSON serialization: for key in data.keys(): value = data[key] # Skip scalars and strings: if type(value) is not list: continue # Check data type of list elements: types = [type(val) for val in value] if restore == 'inner' or (str in types): # Inner lists to arrays: for i, val in enumerate(value): if type(val) is list: value[i] = np.array(val) data.update({key: value}) else: # Entire list to array: data.update({key: np.array(value)}) return data
Loads data collection in dictionary format from .json or .txt file.
Parameters
path
:str
- Absolute or relative path to file in which data is stored.
restore
:str
, optional- If specified, converts stored lists back to Numpy ndarrays. If 'full', calls np.array() on entire list. If 'inner', leaves outer wrapper list as is and calls np.array() on all of its elements that are lists. Lists that contain any strings are always treated in the latter way. May have unexpected outcomes for higher levels of nesting! The default is None.
Returns
data
:dict
- Dictionary containing loaded data.
def merge_wavs(file_paths, save_path=None, return_out=False)
-
Expand source code
def merge_wavs(file_paths, save_path=None, return_out=False): """ Loads and concatenates audio recording data from multiple .wav files. Merged data can be written to a new .wav file and/or returned directly. Wrapper to audioio's load_audio() and write_audio() functions. Parameters ---------- file_paths : list or tuple (m,) of str Collection of paths to several .wav files for merging. Recording data must have the same sampling rate and channel count. Arrays are stacked vertically along the time axis in order of the input list, so that data from the first file is on top. save_path : str, optional If specified, writes the merged recording data to a new .wav file at the given path. The default is None. return_out : bool, optional If True, returns the merged recording data as a numpy array together with the underlying sampling rate in Hz as float. Auto-enabled if save_path is not specified. The default is False. Returns ------- merged : 2D array of floats (n, p) Merged recording data from all input files. Rows correspond to time, columns to individual channels. Only returned if return_out is True or save_path is None. rate : float Sampling rate of the merged recording data in Hz. Only returned if return_out is True or save_path is None. Raises ------ ValueError Breaks if any input file has a different sampling rate than the others. Breals without dedicated error if channel count differs between files. """ # Load recording data from audio files: signals, rates = zip(*[load_audio(str(path)) for path in file_paths]) if len(set(rates)) > 1: # Validate sampling rate consistency: raise ValueError('All recordings must have the same sampling rate.') # Merge recording list: merged = np.vstack(signals) if save_path is not None: # Optional saving to new audio file: write_audio(save_path, merged, rates[0]) if return_out or save_path is None: # Optional outputting: return merged, rates[0] return None
Loads and concatenates audio recording data from multiple .wav files. Merged data can be written to a new .wav file and/or returned directly. Wrapper to audioio's load_audio() and write_audio() functions.
Parameters
file_paths
:list
ortuple (m,)
ofstr
- Collection of paths to several .wav files for merging. Recording data must have the same sampling rate and channel count. Arrays are stacked vertically along the time axis in order of the input list, so that data from the first file is on top.
save_path
:str
, optional- If specified, writes the merged recording data to a new .wav file at the given path. The default is None.
return_out
:bool
, optional- If True, returns the merged recording data as a numpy array together with the underlying sampling rate in Hz as float. Auto-enabled if save_path is not specified. The default is False.
Returns
merged
:2D array
offloats (n, p)
- Merged recording data from all input files. Rows correspond to time, columns to individual channels. Only returned if return_out is True or save_path is None.
rate
:float
- Sampling rate of the merged recording data in Hz. Only returned if return_out is True or save_path is None.
Raises
ValueError
- Breaks if any input file has a different sampling rate than the others. Breals without dedicated error if channel count differs between files.
def merge_wav_series(rec_dur=20.0,
unit='seconds',
format='%Y%m%dT%H%M%S',
n_tag=15,
save_dir=None,
return_out=False,
return_paths=False,
**search_kwargs)-
Expand source code
def merge_wav_series(rec_dur=20., unit='seconds', format='%Y%m%dT%H%M%S', n_tag=15, save_dir=None, return_out=False, return_paths=False, **search_kwargs): """ Automatic identification and merging of consecutive audio recordings. Uses search_files() to fetch all .wav files in the target directory that match the given search criteria, then extracts the time stamp from the end of each file name. Recordings whose start times are separated by an interval matching the specified duration are appended into a series and written to a new .wav file. Can return the merged recording data and/or a dictionary of paths to the new files and the merged ones. Parameters ---------- rec_dur : float, optional Standard recording duration in the given unit. Must be the same for all files in a series for correct identification. The default is 20.0. unit : str, optional Unit of the recording duration. Must be known to datetime's timedelta class. The default is 'seconds'. format : str, optional Standard time stamp format used in all file names. Must be readable by datetime's strptime/strftime functions. Newly created files are tagged in the same format. May contain non-coding letters (no leading "%"), which are inherited by the time stamps of new files. The default is '%Y%m%dT%H%M%S', so 19980330T102700 for March 30th 1998 at 10:27 am. n_tag : int, optional Number of characters occupied by the time stamp at the end of each file name. This is not the same as the length of the format string! The default is 15, matching the default format. save_dir : str or pathlib.Path object, optional Path to the folder where the merged recording data should be written to file. If None, saves into sub-folder "merged" in the target directory. The directory is created if missing in both cases. The default is None. return_out : bool, optional If True, returns a list of tuples containing the merged recording data and the corresponding sampling rate per series. If both return options are True, returns a tuple (data, paths). The default is False. return_paths : bool, optional If True, returns a dictionary of the paths to the new .wav files (keys) and lists of paths to the merged original files (values) for reference. If both return options are True, returns a tuple (data, paths). The default is False. **search_kwargs : dict Keyword arguments passed to search_files() for specifying the search criteria when fetching .wav files. Use 'dir' to set the target directory (default is current directory). Use 'keywords', 'incl', and 'excl' to filter by file name (default is any .wav file). Forces 'ext' to 'wav' and 'as_path' to True. The default is {}. Returns ------- merge_data : list (m,) of tuples (2,) of 2D array (n, p) and float For each identified recording series, the merged recording data from all matching .wav files as a numpy array and the corresponding sampling rate in Hz as float. Rows correspond to time, columns to individual channels. Only returned if return_out is True. merge_files : dict For each identified recording series, the path to the newly created .wav file in the given storage folder (keys) and a list of paths to the original .wav files that have been merged (values). All paths are returned as pathlib.Path object. Only returned if return_paths is True. """ # Enforce function-specific default search settings: search_kwargs.update({'ext': 'wav', 'as_path': True}) # Optional default settings: if 'dir' not in search_kwargs: search_kwargs['dir'] = '.' if 'resolve' not in search_kwargs: search_kwargs['resolve'] = False # Fetch all matching .wav files: paths = search_files(**search_kwargs) if not paths: # Early exit for unsuccessful file search in target directory: print('WARNING: No .wav files found matching the search criteria.') return None # Extract time stamp from end of file names and convert to precise times: times = [datetime.strptime(path.stem[-n_tag:], format) for path in paths] delta = timedelta(**{unit: rec_dur}) # Sort start times chronologically: sorted_inds = np.argsort(times) times, paths = np.array(times)[sorted_inds], np.array(paths)[sorted_inds] # Match start-to-start intervals against standard recording duration: break_inds = np.nonzero(np.append(np.diff(times) != delta, True))[0] # Identify and extract recording series: series_paths, series_times, start_ind = [], [], 0 for end_ind in break_inds: if end_ind - start_ind: # Series must contain at least two consecutive recordings: series_paths.append(paths[start_ind : end_ind + 1].tolist()) series_times.append(times[start_ind : end_ind + 1].tolist()) start_ind = end_ind + 1 if not series_paths: # Early exit for failure to identify any recording series: print('WARNING: No recording series found among the given .wav files.'\ '\nA series must contain at least two recordings with a time'\ f'stamp interval of {rec_dur} {unit}.') return None # Prepare storage: if save_dir is None: # Default to sub-folder "merged" in target directory: save_dir = pathlib.Path(search_kwargs['dir']) / 'merged' # Ensure folder existence (given or default): save_dir = assert_dir(save_dir, as_path=True) # Merge each recording series: merge_data, merge_files = [], {} for series, start_times in zip(series_paths, series_times): # Adapt series time stamp (full duration): t_start = start_times[0].strftime(format) t_end = (start_times[-1] + delta).strftime(format) # Assemble path to save new .wav file into storage folder: name = pathlib.Path(series[0]).stem[:-n_tag] + f'{t_start}-{t_end}.wav' name = str(save_dir / name) # Merge recording data and write array to file: out = merge_wavs(series, name, return_out=return_out) # Log signal and rate: merge_data.append(out) # Log new and merged files: merge_files[name] = series # Return options: if return_out and return_paths: return (merge_data, merge_files) elif return_out: return merge_data elif return_paths: return merge_files return None
Automatic identification and merging of consecutive audio recordings. Uses search_files() to fetch all .wav files in the target directory that match the given search criteria, then extracts the time stamp from the end of each file name. Recordings whose start times are separated by an interval matching the specified duration are appended into a series and written to a new .wav file. Can return the merged recording data and/or a dictionary of paths to the new files and the merged ones.
Parameters
rec_dur
:float
, optional- Standard recording duration in the given unit. Must be the same for all files in a series for correct identification. The default is 20.0.
unit
:str
, optional- Unit of the recording duration. Must be known to datetime's timedelta class. The default is 'seconds'.
format
:str
, optional- Standard time stamp format used in all file names. Must be readable by datetime's strptime/strftime functions. Newly created files are tagged in the same format. May contain non-coding letters (no leading "%"), which are inherited by the time stamps of new files. The default is '%Y%m%dT%H%M%S', so 19980330T102700 for March 30th 1998 at 10:27 am.
n_tag
:int
, optional- Number of characters occupied by the time stamp at the end of each file name. This is not the same as the length of the format string! The default is 15, matching the default format.
save_dir
:str
orpathlib.Path object
, optional- Path to the folder where the merged recording data should be written to file. If None, saves into sub-folder "merged" in the target directory. The directory is created if missing in both cases. The default is None.
return_out
:bool
, optional- If True, returns a list of tuples containing the merged recording data and the corresponding sampling rate per series. If both return options are True, returns a tuple (data, paths). The default is False.
return_paths
:bool
, optional- If True, returns a dictionary of the paths to the new .wav files (keys) and lists of paths to the merged original files (values) for reference. If both return options are True, returns a tuple (data, paths). The default is False.
**search_kwargs
:dict
- Keyword arguments passed to search_files() for specifying the search criteria when fetching .wav files. Use 'dir' to set the target directory (default is current directory). Use 'keywords', 'incl', and 'excl' to filter by file name (default is any .wav file). Forces 'ext' to 'wav' and 'as_path' to True. The default is {}.
Returns
merge_data
:list (m,)
oftuples (2,)
of2D array (n, p) and float
- For each identified recording series, the merged recording data from all matching .wav files as a numpy array and the corresponding sampling rate in Hz as float. Rows correspond to time, columns to individual channels. Only returned if return_out is True.
merge_files
:dict
- For each identified recording series, the path to the newly created .wav file in the given storage folder (keys) and a list of paths to the original .wav files that have been merged (values). All paths are returned as pathlib.Path object. Only returned if return_paths is True.
def species_collection()
-
Expand source code
def species_collection(): """ Shortcut to retrieve the current species scope of the songdetector. Add or remove species to modify model scope. Returns ------- species_list : list of str (m,) Species names in alphabetical order. """ species_list = ["Arcyptera_fusca", "Chorthippus_albomarginatus", "Chorthippus_apricarius", "Chorthippus_biguttulus", "Chorthippus_bornhalmi", "Chorthippus_brunneus", "Chorthippus_dorsatus", "Chorthippus_mollis", "Chorthippus_pullus", "Chorthippus_vagans", "Chrysochraon_dispar", "Euthystira_brachyptera", "Gomphocerippus_rufus", "Gomphocerus_sibiricus", "Omocestus_haemorrhoidalis", "Omocestus_rufipes", "Omocestus_viridulus", "Pseudochorthippus_montanus", "Pseudochorthippus_parallelus", "Sphingonotus_caerulans", "Stauroderus_scalaris", "Stenobothrus_lineatus", "Stenobothrus_nigromaculatus", "Stenobothrus_rubicundulus", "Stenobothrus_stigmaticus", "Stethophyma_grossum"] return species_list
Shortcut to retrieve the current species scope of the songdetector. Add or remove species to modify model scope.
Returns
species_list
:list
ofstr (m,)
- Species names in alphabetical order.
def genus_collection()
-
Expand source code
def genus_collection(): """ Retrieves each unique genus in the current species_collection(). Returns ------- list of str (m,) Genus names in alphabetical order. """ genus_list = [spec.split('_')[0] for spec in species_collection()] return list(np.unique(genus_list))
Retrieves each unique genus in the current species_collection().
Returns
list
ofstr (m,)
- Genus names in alphabetical order.
def extract_species(paths,
omit_noise=True,
omit_subspec=True,
pretty_format=False,
short_genus=False,
tex_it=False)-
Expand source code
def extract_species(paths, omit_noise=True, omit_subspec=True, pretty_format=False, short_genus=False, tex_it=False): """ Extracts unique species names from a collection of file paths. Filenames must start with species names in the format "Genus_species_*" or "Genus_species-subspecies_*". Remainders of filenames are ignored. Parameters ---------- paths : str or list of str (m,) Paths to files whose name contains a species name. omit_noise : bool, optional If True, skips any filename that contains the word "noise". The default is True. omit_subspec : bool, optional If True, removes sub-species designations from species names. The default is True. pretty_format : bool, optional If True, replaces original separators ('_', '-') with whitespace for pretty printing. This behavior also applies if short_genus or tex_it are True, regardless of pretty_format. The default is False. short_genus : bool, optional If True, abbreviates genus designations to their first letter and uses whitespace as separators. The default is False. tex_it : bool, optional If True, formats species names to Latex italic font and uses whitespace as separators. The default is False. Returns ------- species_list : list of str (n,) Formatted unique species names in the given collection of file paths. """ # Set up separators for different formatting styles: sep = [' ', ' '] if pretty_format or short_genus or tex_it else ['_', '-'] # Extract filenames from paths: paths = crop_paths(paths) species_list = [] for path in paths: # Skip over noise files: if omit_noise and 'noise' in path: continue # Split filename into segments: path_segments = path.split('_') if omit_subspec: # Remove sub-species designation: species_segment = path_segments[1].split('-')[0] else: # Keep sub-species with desired separator: species_segment = path_segments[1].replace('-', sep[1]) if short_genus: # Abbreviate genus designation: name = path_segments[0][0] + '.' + sep[0] + species_segment else: # Keep full genus with desired separator: name = path_segments[0] + sep[0] + species_segment if name not in species_list: # Add novel species: species_list.append(name) if tex_it: # Format species names in italic LaTeX font: species_list = [f'\\textit{{{species}}}' for species in species_list] return species_list
Extracts unique species names from a collection of file paths. Filenames must start with species names in the format "Genus_species_" or "Genus_species-subspecies_". Remainders of filenames are ignored.
Parameters
paths
:str
orlist
ofstr (m,)
- Paths to files whose name contains a species name.
omit_noise
:bool
, optional- If True, skips any filename that contains the word "noise". The default is True.
omit_subspec
:bool
, optional- If True, removes sub-species designations from species names. The default is True.
pretty_format
:bool
, optional- If True, replaces original separators ('_', '-') with whitespace for pretty printing. This behavior also applies if short_genus or tex_it are True, regardless of pretty_format. The default is False.
short_genus
:bool
, optional- If True, abbreviates genus designations to their first letter and uses whitespace as separators. The default is False.
tex_it
:bool
, optional- If True, formats species names to Latex italic font and uses whitespace as separators. The default is False.
Returns
species_list
:list
ofstr (n,)
- Formatted unique species names in the given collection of file paths.
def count_files(paths, spec_list=None, as_array=False, **kwargs)
-
Expand source code
def count_files(paths, spec_list=None, as_array=False, **kwargs): """ Counts the occurrences of target species in a collection of file paths. Format of species names must be consistent between paths and spec_list. Parameters ---------- paths : str or list of str (m,) Paths to files whose name contains a species name. The name of one species must follow the same format across paths to avoid miscounts. spec_list : str or list of str (n,), optional Species names to search for in paths. If None, uses extract_species() to retrieve all unique species names from paths. The default is None. as_array : bool, optional If True, returns file counts as Numpy array (in order of spec_list). Else, returns a dictionary with species names as keys for readability and saving. The default is False. **kwargs : dict, optional Keyword arguments passed to extract_species() if spec_list is None. Can be used to format dictionary keys or to treat sub-species as separate categories during counting. Returns ------- file_count : dict or 1D array of ints (n,) Counts of files containing each target species in the specified format. """ if spec_list is None: # Auto-generate target species: spec_list = extract_species(paths, **kwargs) # Assert iterable: spec_list = check_list(spec_list) # Count occurrences of each species: file_count = np.zeros(len(spec_list), dtype=int) if as_array else {} for i, species in enumerate(spec_list): species_count = np.sum([species in path for path in paths]) file_count[i if as_array else species] = species_count return file_count
Counts the occurrences of target species in a collection of file paths. Format of species names must be consistent between paths and spec_list.
Parameters
paths
:str
orlist
ofstr (m,)
- Paths to files whose name contains a species name. The name of one species must follow the same format across paths to avoid miscounts.
spec_list
:str
orlist
ofstr (n,)
, optional- Species names to search for in paths. If None, uses extract_species() to retrieve all unique species names from paths. The default is None.
as_array
:bool
, optional- If True, returns file counts as Numpy array (in order of spec_list). Else, returns a dictionary with species names as keys for readability and saving. The default is False.
**kwargs
:dict
, optional- Keyword arguments passed to extract_species() if spec_list is None. Can be used to format dictionary keys or to treat sub-species as separate categories during counting.
Returns
file_count
:dict
or1D array
ofints (n,)
- Counts of files containing each target species in the specified format.
def create_subsets(paths, amount_train, spec_list=None)
-
Expand source code
def create_subsets(paths, amount_train, spec_list=None): """ Splits a collection of file paths into a training and a test subsets. Randomly assigns the given amount of species-specific files to the training set and any remainders to the test set. Format of species names must be consistent between paths and spec_list. Parameters ---------- paths : list of str (m,) Paths to files from which to create subsets. Each filename must contain a species name to identify species-specific files. amount_train : int or float If int, the absolute number of training files per species. If float, the proportion of training files per species (ensures at least one file per species). Training takes precedence over testing, so that there may be no files left for the test set if too many training files are requested. spec_list : str or list of str (n,), optional Species to consider in file collection when creating subsets. If None, uses extract_species() to retrieve all unique species names from paths. The default is None. Returns ------- train_subset : list of str (p,) Randomly assigned species-specific training files. test_subset : list of str (q,) Remaining species-specific test files. """ if spec_list is None: # Auto-generate target species: spec_list = extract_species(paths) train_subset = [] test_subset = [] # Assemble file subsets: for species in spec_list: # Get species-specific files: species_files = [file for file in paths if species in file] n_files = len(species_files) if isinstance(amount_train, int): # Absolute number of training files (not more than available): n_train = amount_train if amount_train < n_files else n_files elif isinstance(amount_train, float): # Proportion of training files (at least one): n_train = np.max([int(np.round(n_files*amount_train)), 1]) # Shuffle and split: np.random.shuffle(species_files) train_subset += species_files[:n_train] test_subset += species_files[n_train:] return train_subset, test_subset
Splits a collection of file paths into a training and a test subsets. Randomly assigns the given amount of species-specific files to the training set and any remainders to the test set. Format of species names must be consistent between paths and spec_list.
Parameters
paths
:list
ofstr (m,)
- Paths to files from which to create subsets. Each filename must contain a species name to identify species-specific files.
amount_train
:int
orfloat
- If int, the absolute number of training files per species. If float, the proportion of training files per species (ensures at least one file per species). Training takes precedence over testing, so that there may be no files left for the test set if too many training files are requested.
spec_list
:str
orlist
ofstr (n,)
, optional- Species to consider in file collection when creating subsets. If None, uses extract_species() to retrieve all unique species names from paths. The default is None.
Returns
train_subset
:list
ofstr (p,)
- Randomly assigned species-specific training files.
test_subset
:list
ofstr (q,)
- Remaining species-specific test files.
def file_subsets(learn_files, nolearn_files, amount_train, verbose=None)
-
Expand source code
def file_subsets(learn_files, nolearn_files, amount_train, verbose=None): """ Wraps create_subsets() to split both learn and nolearn file collection. Learn files contain data that the model should learn to recognize. Nolearn files contain data that the model should learn to avoid. Both are split into training and test subsets. Parameters ---------- learn_files : list of str (m,) Paths to files that contain desired learn data. Each filename must contain a species name to identify species-specific files. nolearn_files : list of str (n,) Paths to files that contain desired nolearn data. Each filename must contain a species name to identify species-specific files. amount_train : int or float If int, the absolute number of training files per species. If float, the proportion of training files per species (ensures at least one file per species). Training takes precedence over testing, so that there may be no files left for the test set if too many training files are requested. verbose : str, optional If 'short', prints summary of file and species counts for each subset. If 'full', prints additional metadata for each subset: Doubletag (file may contain duplicate segments), Caution (file may not be unique), Temp (file has some temperature information). The default is None. Returns ------- learn_train : list of str (p,) Randomly assigned species-specific training files (learn data). nolearn_train : list of str (q,) Randomly assigned species-specific training files (nolearn data). test_files : list of str (r,) Remaining species-specific test files (both learn and nolearn data). """ # Create separate subsets for learn and nolearn data: learn_train, learn_test = create_subsets(learn_files, amount_train) nolearn_train, nolearn_test = create_subsets(nolearn_files, amount_train) test_files = learn_test + nolearn_test # Optional feedback: if verbose in ['short', 'full']: # Count included files: n_all = len(learn_files) + len(nolearn_files) n_learn = len(learn_train) n_nolearn = len(nolearn_train) n_test = len(test_files) # Count included species: learn_species = len(extract_species(learn_train)) nolearn_species = len(extract_species(nolearn_train)) test_species = len(extract_species(test_files)) all_species = learn_species + nolearn_species # Print short summary: print(f'\nfetched {n_all} files / {all_species} species'\ f'\nlearn: {n_learn} / {learn_species}'\ f'\nnolearn: {n_nolearn} / {nolearn_species}'\ f'\ntest: {n_test} / {test_species}\n') if verbose == 'full': # Gather additional metadata per subset: subsets = [learn_train, nolearn_train, test_files] strings = ['learn:', 'nolearn:', 'test:'] print('files with special tags:') for subset, string in zip(subsets, strings): # Count and print occurences of file tags: n_dt = len([f for f in subset if 'DT' in f]) n_caution = len([f for f in subset if 'CAUTION' in f]) n_temp = len([f for f in subset if '_T' in f]) print(f'{string} doubletag: {n_dt}, caution: {n_caution}, '\ f'temp: {n_temp}') return learn_train, nolearn_train, test_files
Wraps create_subsets() to split both learn and nolearn file collection. Learn files contain data that the model should learn to recognize. Nolearn files contain data that the model should learn to avoid. Both are split into training and test subsets.
Parameters
learn_files
:list
ofstr (m,)
- Paths to files that contain desired learn data. Each filename must contain a species name to identify species-specific files.
nolearn_files
:list
ofstr (n,)
- Paths to files that contain desired nolearn data. Each filename must contain a species name to identify species-specific files.
amount_train
:int
orfloat
- If int, the absolute number of training files per species. If float, the proportion of training files per species (ensures at least one file per species). Training takes precedence over testing, so that there may be no files left for the test set if too many training files are requested.
verbose
:str
, optional- If 'short', prints summary of file and species counts for each subset. If 'full', prints additional metadata for each subset: Doubletag (file may contain duplicate segments), Caution (file may not be unique), Temp (file has some temperature information). The default is None.
Returns
learn_train
:list
ofstr (p,)
- Randomly assigned species-specific training files (learn data).
nolearn_train
:list
ofstr (q,)
- Randomly assigned species-specific training files (nolearn data).
test_files
:list
ofstr (r,)
- Remaining species-specific test files (both learn and nolearn data).
def organize_files(target, amount_train, auto_complete=False, verbose=None, **kwargs)
-
Expand source code
def organize_files(target, amount_train, auto_complete=False, verbose=None, **kwargs): """ Top-level file search and creation of training and test subsets. Finds target among pre-defined species scopes and fetches corresponding files to be split into subsets. Target species determines learn files (data that the model should learn to recognize), other species in the scope determine nolearn files (data that the model should learn to avoid). Currently available scopes are grasshoppers, frogs, frog courtship calls, and frog territorial calls. Parameters ---------- target : str or list of str (m,) Target species that determine selection of learn files. Must be in the format "Genus_species" or "Genus_species_calltype". Returns with a warning if no known scope is found for the target. If 'grasshoppers', 'frogs', 'frogs_courtship', or 'frogs_territorial', returns early with all files and species in this scope. amount_train : int or float Amount of training files for creation of training and test subsets. If int, the absolute number of training files per species. If float, the proportion of training files per species (ensures at least one file per species). Training takes precedence over testing, so that there may be no files left for the test set if too many training files are requested. auto_complete : bool, optional If True, automatically completes target species if it is a substring of grasshopper and frog species (no calltype distinction). Allows passing of targets in the format "species" without returning a warning. May result in multiple potentially conflicting target species. The default is False. verbose : str, optional If 'short', prints summary of file and species counts for each subset. If 'full', prints additional metadata for each subset: Doubletag (file may contain duplicate segments), Caution (file may not be unique), Temp (file has some temperature information). The default is None. **kwargs : dict, optional Keyword arguments passed to search_files() when fetching learn/nolearn files, or when returning with a shortcut. Can be used to specify a custom directory and file extension for the file search. Returns ------- file_bundle : list (3,) of lists of str Paths to subset files in the order [learn, nolearn, test]. Some entries may be empty if no files are found in the given directory, or if too many training files are requested. If target is a shortcut string, returns a flat list of all files in the corresponding scope. species_bundle : list (3,) of lists of str Species names corresponding to file_bundle in the same order. If target is a shortcut string, returns a flat list of all species in the corresponding scope. """ # Available species scopes: grasshoppers = species_collection() frogs = ['Rana_esculenta', 'Rana_lessonae', 'Rana_ridibunda'] frogs_courtship = [frog + '_courtship' for frog in frogs] frogs_territorial = [frog + '_territorial' for frog in frogs] all_species = grasshoppers + frogs if auto_complete: # Auto-complete target species: target = [spec for spec in all_species if target in spec] # Determine desired scope and handle shortcuts: if any(np.isin(grasshoppers + ['grasshoppers'], target)): species = grasshoppers # Fetch all grasshopper files: if target == 'grasshoppers': return search_files(grasshoppers, **kwargs), species elif any(np.isin(frogs + ['frogs'], target)): species = frogs # Fetch all frog files: if target == 'frogs': return search_files(frogs, **kwargs), species elif any(np.isin(frogs_courtship + ['frogs_courtship'], target)): species = frogs_courtship # Fetch all files with frog courtship calls: if target == 'frogs_courtship': return search_files(frogs_courtship, **kwargs), species elif any(np.isin(frogs_territorial + ['frogs_territorial'], target)): species = frogs_territorial # Fetch all files with frog territorial calls: if target == 'frogs_territorial': return search_files(frogs_territorial, **kwargs), species else: print(f'WARNING: No known model scope for target {target}.') return None, None # Assert iterable: target = check_list(target) # Fetch learn files (target species): learn = search_files(target, **kwargs) # Fetch nolearn files (non-target species): others = [spec for spec in species if not (spec in target)] nolearn = search_files(others, **kwargs) # Assemble training and test subsets: file_bundle = list(file_subsets(learn, nolearn, amount_train, verbose)) species_bundle = [target, others, extract_species(file_bundle[2])] return file_bundle, species_bundle
Top-level file search and creation of training and test subsets. Finds target among pre-defined species scopes and fetches corresponding files to be split into subsets. Target species determines learn files (data that the model should learn to recognize), other species in the scope determine nolearn files (data that the model should learn to avoid). Currently available scopes are grasshoppers, frogs, frog courtship calls, and frog territorial calls.
Parameters
target
:str
orlist
ofstr (m,)
- Target species that determine selection of learn files. Must be in the format "Genus_species" or "Genus_species_calltype". Returns with a warning if no known scope is found for the target. If 'grasshoppers', 'frogs', 'frogs_courtship', or 'frogs_territorial', returns early with all files and species in this scope.
amount_train
:int
orfloat
- Amount of training files for creation of training and test subsets. If int, the absolute number of training files per species. If float, the proportion of training files per species (ensures at least one file per species). Training takes precedence over testing, so that there may be no files left for the test set if too many training files are requested.
auto_complete
:bool
, optional- If True, automatically completes target species if it is a substring of grasshopper and frog species (no calltype distinction). Allows passing of targets in the format "species" without returning a warning. May result in multiple potentially conflicting target species. The default is False.
verbose
:str
, optional- If 'short', prints summary of file and species counts for each subset. If 'full', prints additional metadata for each subset: Doubletag (file may contain duplicate segments), Caution (file may not be unique), Temp (file has some temperature information). The default is None.
**kwargs
:dict
, optional- Keyword arguments passed to search_files() when fetching learn/nolearn files, or when returning with a shortcut. Can be used to specify a custom directory and file extension for the file search.
Returns
file_bundle
:list (3,)
oflists
ofstr
- Paths to subset files in the order [learn, nolearn, test]. Some entries may be empty if no files are found in the given directory, or if too many training files are requested. If target is a shortcut string, returns a flat list of all files in the corresponding scope.
species_bundle
:list (3,)
oflists
ofstr
- Species names corresponding to file_bundle in the same order. If target is a shortcut string, returns a flat list of all species in the corresponding scope.