A frequent programming task is to read in data from a file and process it. Python has functions and methods for doing this in a few lines of code.
open(filename) Takes the name of a file and produces a readable file value.
file.read() Returns the content of the file as a string.
This means that, for example, if you have a file called words5.txt in your workspace, you can use the definition below, and a string of the text in the file will be stored in the contents variable:
contents = open(“words5.txt”).read()
Note that if you want to open a file named something other than words5.txt, you would need to replace the filename in the use of open(). We provide three files you can use for this assignment, words.txt, words100.txt, and words5.txt. You should be able to see these files when you open the assignment for the first time. words.txt is a file that contains some standard English words for spell checking and autocorrect (Ed cannot display a file this large, but feel free to download the file and see what’s in it). words100.txt is a random sample of those words, and words5.txt is a sample of just 5 words. These smaller samples are useful for testing your code.
This assignment will have you read in a file that has text on many different lines. You can use the string split method to get a list of the lines in the file. The way text is stored in files, the special character \n indicates a new line (n is for newline) – you see it as the text going onto the next line, but Python sees it as this special character. So you can split the text from the file on newlines and get a list of strings with the code below (a more detailed example is shown at the end of this document).
contents = open(“words5.txt”).read() lines = contents.strip().split(“\n”) # strip() removes extra newlines from the start/end
Implementation
You will choose an interesting textual dataset – you can use the provided words files or you can find one on your own. Then, in explore.py, you’ll write an an interesting search on the text (using filter) and interesting transformation of the lines of text (using map). You must do both.
You can pick any search criteria you want. For example, you might choose one of the following for filter:
Find all the lines that have at least three “a” characters
Find all the lines that contain a particular string
Find all the lines that start with a particular string
Find all the lines that contain numeric data
You can choose any transformation of the string data you want. You might choose one of the following for map:
Make a list of all the strings with all the letters capitalized
Make a list of all the strings with all the a’s replaced with b’s
Make a list of all the numeric strings converted to numbers
Get creative! You can browse the Python string methods to see what some options are.
You must write (at least) four function definitions:
A function you will pass to filter, which can have any name you choose.
search_lines, which takes a list of strings and returns a list of strings. It must call filter.
A function you will pass to map, which can have any name you choose.
transform_lines, which takes a list of strings and returns a list of a datatype of your choice. It must call map.
Writeup
Finally, please explain what you’ve done in writeup.txt. We’ve put questions with a template in for you, but we’re interested in knowing:
Are there any bugs or issues in your code? (Please put this in the final part, “Bugs/Issues”.)
Please demonstrate your interactions.
For search_lines, call search_lines on a large sample (at least 100 items, like words100.txt). Compare the length of the filtered list to the length of the original list using len on both. Please do not copy/paste a long output!
Fir transform_lines, call transform_lines on a small sample (no more than 20 items, like words5.txt) and copy/paste the (short!) output. Make sure that you’ve chosen a sample file that highlights the interesting behavior in your function.