“What is a Woodchuck animal?”, you might ask. Or perhaps, “Why would you start off with a rhetorical question like a grade-schooler?” you might also ask. The two are surprisingly related. Or not at all. It is one of the two. It matters not though, for the answer to the first is that a ‘Woodchuck’ animal is any animal which can be recited in place of woodchuck in the well-known tongue twister involving said creature, “How much wood can a woodchuck chuck,” and so on and so fourth. The format is as follows:
How much [noun] could a [noun-verb] [verb] if a [noun-verb] could [verb] [noun].
Which, despite it being the template, seems to form a perfectly pleasant tongue-twister of its own.
To be more specific, the kind of noun required is an uncountable noun, necessitated by “much”. We can, however, include countable nouns and replace “much” with “many”, while also pluralizing the noun when it sits on its own. We will see later that this addition greatly expands the scope of Woodchuck animals (and it may also happen to make them easier to search for).
My first encounter with a non-woodchuck ‘woodchuck’ animal was during a roadtrip. I remember myself as a younger boy, staring out the window at placid rows of long grass and water roll by, when it suddently hit me. Sandpiper. Indeed, you might say, what a marvelous thought, but what of it? Well, do not worry, I would say, I will tell you promptly what of it. For, how much sand can a sandpiper pipe if a sandpiper could pipe sand? For me that day, the glorious sandpiper gave the woodchuck a great shove off its high horse back into the burrow from whence it came.
From this led years of me silently reciting my very own sandpiper version to myself. More recently, however, I had an additional thought. As demolisher of the woodchuck’s reign, it should indeed be my duty to liberate other animals from their non-tongue-twister confines. Hence, I turned to a member of the Serptentes suborder (or the surname of a particular flying circus owner), Python.
The task seemed relatively straightforward: use files of nouns and verbs to search a file of animal names in order to compile a list of animals of the ‘woodchuck’ variety. So I patched in some pseudo-code and started piecing together what I could.
I had learned the basics of Python a bit back, but I had never really used it to do anything, so I had some digging through documentation to do. I had to do a review of Python’s file IO. I additionally wanted to make it as clean as my capacity allowed. One consequence of this was brushing up on error handling, for as a simple lesser-classman student I (unfortunately) never really have to think about such things for my assignments, so I did a bit of learning on that. Additionally, I wished to make it more usable by giving it the ability to take argument values passed in from the terminal. I could have done this using
sys.args()—but I found a Python module which makes it a little bit nicer:
The code itself took on several iterations. First, I got the search working with a small dataset. Once I had finished a parser to extract nouns to one file and verbs to another from a large dictionary, using those files the search took much too long. I ended up reordering the loops and adding a conditional, so that it only checked through every pair of noun and verb if at least the noun was found in the animal name. This was a much larger improvement over the strict n^3 running time. This also allowed me to fix it so it stopped creating duplicates, so that once it found a single noun-verb instance for an animal name it would exit the two inner loops and continue searching on the next animal, rather than waste time searching for more matching instances.
import argparse import sys import time parser = argparse.ArgumentParser(prog='woodchuckfinder.py') parser.add_argument("animal_file", help="file of animal names") parser.add_argument("noun_file", help="file of nouns") parser.add_argument("verb_file", help="file of verbs") parser.add_argument("output_file", help="file to output 'woodchuck' names to") args=parser.parse_args() #progress bar display function courtesy of vladignatyev def progress(count, total, status=''): bar_len = 60 filled_len = int(round(bar_len * count / float(total))) percents = round(100.0 * count / float(total), 1) bar = '=' * filled_len + '-' * (bar_len - filled_len) sys.stdout.write('[%s] %s%s ...%s\r' % (bar, percents, '%', status)) sys.stdout.flush() try: total = 0 #get animal_file's line count with open(args.animal_file) as animal_file: for animal in animal_file: total += 1 start = time.time() with open(args.animal_file) as animal_file, open(args.noun_file) as noun_file, open(args.verb_file) as verb_file, open(args.output_file, "w+") as output: count = 0 for animal in animal_file: a = (animal.strip()).lower() found = False for noun in noun_file: n = (noun.strip()).lower() if(n in a):#only search noun+verb combinations if at least noun found for verb in verb_file: v = (verb.strip()).lower() if((n + v) in a.lower() or (n + ' ' + v) in a.lower()): output.write(animal.strip() + '\n') found = True break verb_file.seek(0) if(found):#break before found again and duplicates added break noun_file.seek(0) count += 1 progress(count,total, 'search progress') end = time.time() print("Total elapsed time: %s seconds" % (end-start)) except OSError: sys.exit("file not found")
Probably the main thing I learned from this, though, was some of the problem solving involved in parsing through a dataset to get what you want out of it, such as I faced when creating my compilation of animal names from a taxonomic database:
import csv import argparse import sys import queue import pathlib parser = argparse.ArgumentParser() parser.add_argument("animals", help="a directory or file of ITIS animal taxonomy csv files") parser.add_argument("output", help="output file to append to") args = parser.parse_args() #extracts only the vernacular names from ITIS csv file def parse_and_append(input_path, output_file): try: with input_path.open(newline='') as file, open(output_file, "a") as output: reader = csv.reader(file, delimiter='|') previous = " " for str_list in reader: #check if it is an English vernacular name if(("[VR]" in str_list) and ("English" in str_list)): output.write(str(str_list) + '\n') except OSError: sys.exit("file not found") q = queue.Queue() path = pathlib.Path(args.animals) q.put(path) #empty the file for appending with open(args.output, 'w'): pass while(not q.empty()): curr = q.get_nowait() #if dir, add to queue to be explored if(curr.is_dir()): #enqueue its children for child in curr.iterdir(): q.put(child) #if file, append vernacular names to output elif(curr.is_file()): parse_and_append(curr, args.output)
This code traversed through a directory of taxonomic data files, and appended from each file the vernacular names of animals to the output. Just finding a dataset I could use took me quite a while. There were plenty of dopey lists of animal names which I could have scraped from sites, but I wanted a beefy list, something that would near the be-all and end-all. I found several compendiums, but none were quite my solution.
I finally came across ITIS, the Integrated Taxonomic Information System, it is a US government venture to “create an easily accessible database with reliable information on species names and their hierarchical classification”. This ended up being nearly perfect; the data was accessible and easy enough to parse through. It even gives the option of what data I want included with each taxa, so I could select to only download the vernacular data, which I found out is the word taxonomists use for common-names of creatures. The data files would also include a bunch of gobbledygook before and after, but these can be ignored until it got to the ‘vernacular’ we are after.
A problem I came across with it though, is that while I could narrow down the selection with decent fidelity, the ITIS site had a sort of restriction on the size of the data set it would provide. So, while I could select all of Vertebrata, it was much too large to be provided immediately, and even classes like Aves and Mammalia were too large. The site said it required extra time to process the requests and asked for an email, but nothing ever seemed to come of it. My solution was to find the largest sets which could be immediately downloaded so as to form a partition of the whole. This meant me going through and selecting taxa to request from the site, for a total of about seventy data files. This was a bit tedious, but not beyond my patience. What had be a bit more stricken, though, was the thought of running a program on each file, one at a time. So, I was back to some research and documentation exploring for a solution. Through this, I found Python’s
pathlib module, and learned how to traverse through directories to construct my list of animal names in one go (For fun I organized the files into directories by taxa and taxonomic rank, so it was necessary to go several directories deep, as is evident in the code).
The important fact then comes to light: whether I found any ‘woodchuck’ animals or not. Sort of. I mean, I certainly did, but there also happened to be loads of malarkey. My intentions were to compile as large of a list of these coincidentally named creatures as possible, so I got together a large list of animals. This was well and good, the more to search through, the better—as long as you don’t mind waiting a few (about twenty) minutes. The problem arises when you search with an equally large list; At such a point, you are searching the ocean with a net, the rope of which could be used to lasso the moon.
This is at least what I believe my problem is currently. I am currently using lists of thirty-six thousand animals, twenty thousand nouns, and four thousand verbs. Not enormous by data science standards, but for what they are of, they’re quite large. Thus, my nouns and verbs contain all manner of silly words which are used only in very specific contexts. One solution to this could be manually sifting through the output, and finding and deleting the nouns or verbs which gave rise to less-than obvious ‘woodchuck’ names. This tempted me for a bit, but it would end up being much too much work. Part of this excercise is getting the program to do all the work, and that would spoil it a bit.
Another option to reduce the set I am searching with is finding a “1000 Most Common Nouns” or some such things. This though, cuts down the dataset more than I would like, because I am still looking to compile as large a list as I can. Most preferable would be to find a much larger ranking of word usage with which to sort the words, and then manually set the cut-off myself where the words start becoming too obscure for what I am looking for.
It is not a complete loss. There were certainly plenty of woodchuck fish, as noun-fish is quite common for a name. There can be some refining done, certainly, and in the future I may try to implement such refining, but for now I have a long list of cool animal names, some of which reasonably fit the bill for a woodchuck animal. The full list can be found here.
Finally though, I would like to highlight some of the special little animals I found along my journey.
The Hyrax, which basically has an order of it’s own, and is more closeley related to elephants and manatees than rodents (according to Wikipedia at the time I am writing this):
the pretty-bird Tauraco:
and the big-eyed Trogon:
Then there are lamprey, which will forever be the haunters of my nightmares; but, I will not be including a picture of them, mostly for my sake.
Credits go to Manas Sharma who provided the English Dictionary in csv format that I found at http://www.bragitoff.com/2016/03/english-dictionary-in-csv-format/, and vladignatyev who provided the Python code snippet to display a progress bar in terminal here at: https://gist.github.com/vladignatyev/06860ec2040cb497f0f3 which I used in the woodchuck finder.