Text Processing with Ruby Extract Value from the Data That Surrounds You 1st Edition Rob Miller - The ebook in PDF format is ready for immediate access
Text Processing with Ruby Extract Value from the Data That Surrounds You 1st Edition Rob Miller - The ebook in PDF format is ready for immediate access
com
https://ebookgate.com/product/text-processing-with-ruby-
extract-value-from-the-data-that-surrounds-you-1st-edition-
rob-miller/
OR CLICK BUTTON
DOWLOAD EBOOK
https://ebookgate.com/product/value-maps-valuation-tools-that-unlock-
business-wealth-1st-edition-warren-d-miller/
ebookgate.com
https://ebookgate.com/product/data-and-the-city-1st-edition-rob-
kitchin/
ebookgate.com
https://ebookgate.com/product/text-processing-with-
javascript-1-version-p1-0-converted-edition-faraz-kelhini/
ebookgate.com
Probability and random processes with applications to
signal processing and communications 1st Edition Scott
Miller
https://ebookgate.com/product/probability-and-random-processes-with-
applications-to-signal-processing-and-communications-1st-edition-
scott-miller/
ebookgate.com
https://ebookgate.com/product/from-com-to-profit-inventing-business-
models-that-deliver-value-and-profit-nick-earle/
ebookgate.com
https://ebookgate.com/product/value-leadership-the-7-principles-that-
drive-corporate-value-in-any-economy-1st-edition-peter-s-cohan/
ebookgate.com
https://ebookgate.com/product/corporate-boards-that-create-value-1st-
edition-john-carver/
ebookgate.com
This is a fun, readable, and very useful book. I’d recommend it to anyone who
needs to deal with text—which is probably everyone.
➤ Paul Battley
Developer, maintainer of text gem
I’d recommend this book to anyone who wants to get started with text processing.
Ruby has powerful tools and libraries for the whole ETL workflow, and this book
describes everything you need to get started and succeed in learning.
➤ Hajba Gábor László
Developer
A lot of people get into Ruby via Rails. This book is really well suited to anyone
who knows Rails, but wants to know more Ruby.
➤ Drew Neil
Director, Studio Nelstrom, and author of Practical Vim
Rob Miller
Acknowledgments . . . . . . . . . . . ix
Introduction . . . . . . . . . . . . . xi
3. Shell One-Liners . . . . . . . . . . . 29
Arguments to the Ruby Interpreter 30
Prepending and Appending Code 35
Example: Parsing Log Files 37
Wrapping Up 39
5. Delimited Data . . . . . . . . . . . . 51
Parsing a TSV 52
Delimited Data and the Command Line 56
The CSV Format 58
Wrapping Up 62
6. Scraping HTML . . . . . . . . . . . . 63
The Right Tool for the Job: Nokogiri 63
Searching the Document 64
Working with Elements 72
Exploring a Page 77
Example: Reading a League Table 80
Wrapping Up 88
7. Encodings . . . . . . . . . . . . . 89
A Brief Introduction to Character Encodings 90
Ruby’s Support for Character Encodings 92
Detecting Encodings 98
Wrapping Up 99
Part IV — Appendices
A1. A Shell Primer . . . . . . . . . . . . 229
Running Commands 229
Controlling Output 230
Exit Statuses and Flow Control 232
Many thanks to Alessandro Bahgat, Paul Battley, Jacob Chae, Peter Cooper,
Iris Faraway, Kevin Gisi, Derek Graham, James Edward Gray II, Avdi Grimm,
Hajba Gábor László, Jeremy Hinegardner, Kerri Miller, and Drew Neil for their
helpful technical review comments, questions, and suggestions—all of which
shaped this book for the better.
Thanks to Rob Griffiths, Mark Rogerson, Samuel Ryzycki, David Webb, Lewis
Wilkinson, Alex Windett, and Mike Wright for ensuring there was no chance
I got too big for my football boots.
Unlike binary formats, text has the pleasing quality of being readable by
humans as well as computers, making it easy to debug and requiring no
distinction between output that’s for human consumption and output that’s
to be used as the input for another step in a process.
The second concern is with actually processing the text once we’ve got it into
the program. This usually means either extracting data from within the text,
parsing it into a Ruby data structure, or transforming it into another format.
The most important subject in this second stage is, without a doubt, regular
expressions. We’ll look at regular expression syntax, how Ruby uses regular
expressions in particular, and, importantly, when not to use them and instead
reach for solutions such as parsers.
We’ll also look at the subject of natural language processing in this part of
the book, and how we can use tools from computational linguistics to make
our programs smarter and to process data that we otherwise couldn’t.
The final step is outputting the transformed text or the extracted data some-
where—to a file, to a network service, or just to the screen. Part of this process
is concerned with the actual writing process, and part of it is concerned with
the form of the written data. We’ll look at both of these aspects in the third
part of the book.
Together, these three steps are often described as “extract, transform, and
load” (ETL). It’s a term especially popular with the “big data” folks. Many text
processing tasks, even ones that seem on the surface to be very different from
one another, fall into this pattern of three steps, so I’ve tried to mirror that
structure in the book.
In general, we’re going to explore why Ruby is an excellent tool to reach for
when working with text. I also hope to persuade you that you might reach
for Ruby sooner than you think—not necessarily just for more complex tasks,
but also for quick one-liners.
Most of all, I hope this book offers you some useful techniques that help you
in your day-to-day programming tasks. Where possible, I’ve erred toward the
practical rather than the theoretical: if it does anything, I’d like this book to
point you in the direction of practical solutions to real-world problems. If your
day job is anything like mine, you probably find yourself trawling through
text files, CSVs, and command-line output more often than you might like.
Helping to make that process quick and—dare I say it?—fun would be fantas-
tic.
While the book starts with material likely to be familiar to anyone who’s
written a command-line application in Ruby, there’s still something here for
the more advanced user. Even people who’ve worked with Ruby a lot aren’t
necessarily aware of the material covered in Chapter 3, Shell One-Liners, on
page 29, for example, and I see far too many developers reaching for regular
expressions to parse HTML rather than using the techniques outlined in
Chapter 6, Scraping HTML, on page 63.
Even experienced developers might not have written parsers before (covered
in Chapter 10, Writing Parsers, on page 127), or dabbled in natural language
processing (as we do in Chapter 11, Natural Language Processing, on page
155)—so hopefully those subjects will be interesting regardless of your level of
experience.
I’ve tried to include in each of the chapters material of interest even to more
advanced Rubyists, so there aren’t any chapters that are obvious candidates
to skip if you’re at that end of the skill spectrum.
If you’re not familiar with how to use the command line, there’s a beginner’s
tutorial in Appendix 1, A Shell Primer, on page 229, and a guide to various
commands in Appendix 2, Useful Shell Commands, on page 235. These
appendixes will give you more than enough command-line knowledge to follow
all of the examples in the book.
1. https://pragprog.com/book/rmtpruby/text-processing-with-ruby
Online Resources
The page for this book on the Pragmatic Bookshelf website3 contains a discus-
sion forum, where you can post any comments or questions you might have
about the book and make suggestions for any changes or expansions you’d
like to see in future editions. If you discover any errors in the book, you can
submit them there, too.
Rob Miller
August 2015
2. https://www.cygwin.com/
3. https://pragprog.com/book/rmtpruby/text-processing-with-ruby
The first part of our text processing journey is concerned with getting text into our
program. This text might reside in files, might be entered by the user, or might come
from other processes; wherever it comes from, we’ll learn how to read it.
We’ll also look at taking structure from the text that we read, learning how to parse
CSV files and even scrape information from web pages.
Throughout the course of this chapter, we’ll look at how we can use Ruby to
reach text that resides in files. We’ll look at the basics you might expect, with
some methods to straightforwardly read files in one go. We’ll then look at a
technique that will allow us to read even the biggest files in a memory-efficient
way, by treating files as streams, and look at how this can give us random
access into even the largest files. Let’s take a look.
Opening a File
Before we can do something with a file, we need to open it. This signals our
intent to read from or write to the file, allowing Ruby to do the low-level that
make that intention actually happen on the filesystem. Once it’s done those
things, Ruby gives us a File object that we can use to manipulate the file.
Once we have this File object, we can do all sorts of things with it: read from
the file, write to it, inspect its permissions, find out its path on the filesystem,
check when it was last modified, and much more.
To open a file in Ruby, we use the open method of the File class, telling it the
path to the file we’d like to open. We pass a block to the open method, in which
we can do whatever we like with the file. Here’s an example:
File.open("file.txt") do |file|
# ...
end
Because we passed a block to open, Ruby will automatically close the file for
us after the block finishes, freeing us from doing that cleanup work ourselves.
The argument that open passes to our block, which in this example I’ve called
file, is a File object that points to the file we’ve requested access to (in this case,
file.txt). Unless we tell Ruby otherwise, it will open files in read-only mode, so
we can’t write to them accidentally—a safe default.
Kernel#open
In the real world, it’s common to see people using the global open method rather than
explicitly using File.open:
open("file.txt") do |file|
# ...
end
As well as being shorter, which is always nice, this convenient method is actually a
wrapper for a number of different types of IO objects, not just files. You can use it to
open URLs, other processes, and more. We’ll cover some more uses of open later; for
now, use either File.open or regular open as you prefer.
There’s nothing in our block yet, so this code isn’t very useful; it doesn’t
actually do anything with the file once it’s opened. Let’s take a look at how
we can read content from the file.
We can achieve this by using the read method on our File object:
File.open("file.txt") do |file|
contents = file.read
end
The read method returns for us a string containing the file’s contents, no
matter how large they might be.
Alternatively, if all we’re doing is reading the file and we have no further use
for the File object once we’ve done so, Ruby offers us a shortcut. There’s a read
method on the File class itself, and if we pass it the name of a file, then it will
open the file, read it, and close it for us, returning the contents:
contents = File.read("file.txt")
Whichever method we use, the result is that we have the entire contents of
the file stored in a string. This is useful if we want to blindly pass those con-
tents over to something else for processing—to a Markdown parser, for
example, or to insert it into a database, or to parse it as JSON. These are all
very common things to want to do, so read is a widely used method.
For example, if our file contained some JSON data, we could parse it using
Ruby’s built-in JSON library:
require "json"
json = File.read("file.json")
data = JSON.parse(json)
Line-by-line Processing
Lots of plain-text formats—log files, for instance—use the lines of a file as a
way of structuring the content within them. In files like this, each line repre-
sents a distinct item or record. It’s about the simplest way to separate data,
but this kind of structure is more than enough for many use cases, so it’s
something you’ll run into frequently when processing text.
One example of this sort of log file that you might have encountered before
is from the popular web server Apache. For each request made to it, Apache
will log some information: things like the IP address the request came from,
the date and time that the request was made, the URL that was requested,
and so on. The end result looks like this:
127.0.0.1 - [10/Oct/2014:13:55:36] "GET / HTTP/1.1" 200 561
127.0.0.1 - [10/Oct/2014:13:55:36] "GET /images/logo.png HTTP/1.1" 200 23260
192.168.0.42 - [10/Oct/2014:14:10:21] "GET / HTTP/1.1" 200 561
192.168.0.91 - [10/Oct/2014:14:20:51] "GET /person.jpg HTTP/1.1" 200 46780
192.168.0.42 - [10/Oct/2014:14:20:54] "GET /about.html HTTP/1.1" 200 483
Let’s imagine we wanted to process this log file so that we could see all the
requests made by a certain IP address. Because each line in the file represents
one request, we need some way to loop over the lines in the file and check
whether each one matches our conditions—that is, whether the IP address
at the start of the line is the one we’re interested in.
One way to do this would be to use the readlines method on our File object. This
method reads the file in its entirety, breaking the content up into individual
lines and returning an array:
File.open("access_log") do |log_file|
requests = log_file.readlines
end
At this point, we’ve got an array—requests—that contains every line in the file.
The next step is to loop over those lines and only output the ones that match
our conditions:
File.open("access_log") do |log_file|
requests = log_file.readlines
requests.each do |request|
if request.start_with?("127.0.0.1 ")
puts request
end
end
end
Using each, we loop over each request. We then ask the request if it starts
with 127.0.0.1, and if the response is true, we output it. Lines that don’t start
with 127.0.0.1 will simply be ignored.
While this solution works, it has a problem. Because it reads the whole file
at once, it consumes an amount of memory at least equal to the size of the
file. This will hold up okay for small files, but as our log file grows, so will the
memory consumed by our script.
If you think about it, though, we don’t actually need to have the whole file in
memory to solve our problem. We’re only ever dealing with one line of the file
at any given moment, so we only really need to have that particular line in
memory. For some problems it’s necessary to read the whole file at once, but
this isn’t one of them. Let’s look at how can we rework this example so that
we only read one line at a time.
The solution is to treat the file as a stream. Instead of reading from the
beginning of the file to the end in one go, and keeping all of that information
in memory, we read only a small amount at a time. We might read the first
line, for example, then discard it and move onto the second, then discard that
and move onto the third, and so on until we reach the end of the file. Or we
might instead read it character by character, or word by word. The important
thing is that at no point do we have the full file in memory: we only ever store
the little bit that we’re processing.
The File object yielded to our block has a method called each_line. This method
accepts a block and will step through the file one line at a time, executing
that block once for each line.
File.open("access_log") do |log_file|
log_file.each_line do |request|
if request.start_with?("127.0.0.1 ")
puts request
end
end
end
That’s it. The each_line method allows us to step through each line in the file
without ever having more than one line of the file in memory at a time. This
method will consume the same amount of memory no matter how large the
file is, unlike our first solution.
Just like with File.read, Ruby offers us a shortcut that doesn’t require us to
open the file ourselves: File.foreach. Using it trims the previous example down
a little:
File.foreach("access_log") do |request|
if request.start_with?("127.0.0.1 ")
puts request
end
end
Enumerable Streams
The each_line method of the File class is aliased to each. This might not seem
particularly remarkable, but it’s actually tremendously useful. This is because
Ruby has a module called Enumerable that defines methods like map, find_all,
count, reduce, and many more. The purpose of Enumerable is to make it easy to
search within, add to, delete from, iterate over, and otherwise manipulate
collections. (You’ve probably used methods like these when working with
arrays, for example.)
Well, a file is a collection too. By default, Ruby considers the elements of that
collection to be the lines within the file, so because File includes the Enumerable
module, we can use all of its methods on those lines. This can make many
processing operations simple and expressive, and because many of Enumerable’s
methods don’t require us to consume the whole file—they’re lazy, in other
words—we often retain the performance benefits of streaming, too.
To explore what this means, we can revisit our log example. Let’s imagine you
wanted to group all of the requests made by each IP address, and within that
group them by the URL requested. In other words, you want to end up with
a data structure that looks something like this:
{
"127.0.0.1" => [
"/",
"/images/logo.png"
],
"192.168.0.42" => [
"/",
"/about.html"
],
"192.168.0.91" => [
"/person.jpg"
]
}
Here’s a script that uses the methods offered by Enumerable to achieve this:
requests-by-ip.rb
requests =
File.open("data/access_log") do |file|
file
.map { |line| { ip: line.split[0], url: line.split[5] } }
.group_by { |request| request[:ip] }
.each { |ip, requests| requests.map! { |r| r[:url] } }
end
We open the file just like we did previously. But instead of using each_line to
iterate over the lines of the file, we use map. This loops over the lines of the
file, building up an array as it does so by taking the return value of the block
we pass to it. Here our block is using split to separate the lines on whitespace.
The first of these whitespace-separated fields contains the IP, and the sixth
contains the URL that was requested, so the block returns a hash. The result
of our map operation is therefore an array of hashes that contain only the
information about the request that we’re interested in—the IP address and
the URL.
Next, we use the group_by method. This transforms our array of hashes into a
single hash. It does so by checking the return value of the block that we pass
to it; all the elements of the array that return the same value will be grouped
together. In this case, our block returns the IP part of the request, which
means that all of the requests made by the same IP address will be grouped
together.
The data structure after the group_by operation looks something like this:
{
"127.0.0.1" => [
{:ip=>"127.0.0.1", :url=>"/"},
{:ip=>"127.0.0.1", :url=>"/images/logo.png"}
],
"192.168.0.42" => [
{:ip=>"192.168.0.42", :url=>"/"},
{:ip=>"192.168.0.42", :url=>"/about.html"}
],
"192.168.0.91" => [
{:ip=>"192.168.0.91", :url=>"/person.jpg"}
]
}
This is almost what we were after. The problem is that we have both the IP
address and the URL of the request, rather than just the URL. So the next
step in our chain uses each to loop over these IP address and request combi-
nations. It then uses map! to replace the array of hashes with just the URL
portion, leaving us with an array of strings.
By default, each behaves exactly like each_line, looping over the lines in the file.
But it also accepts an argument allowing you to change the character on
which it will split, from a newline to anything else you might like.
Let’s imagine we had a file with only a single line in it, but that contained
many different records separated by commas:
this is field one,this is field two,this is field three
Again, this method has all of the benefits of other streaming examples; we
only ever have a single character in memory at one time, so we can process
even the largest of files.
For example, we could rewrite the previous example, where we quite verbosely
initialized our n variable and incremented it manually, by using Enumerable’s
count method:
character-count.rb
n =
File.open("file.txt") do |file|
file.each_char.count { |char| char == "b" }
end
The count method accepts a block and will return the number of values for
which the block returned true. This is exactly what our previous code was
doing, but this way is a little shorter and a little neater, and reveals our
intentions more clearly.
This might seem like an academic distinction, but it has an important benefit: it
means that other types of IO in Ruby have those same methods, too. Files, streams,
sockets, Unix pipelines—all of these things are fundamentally similar, and it’s in IO
that these similarities are gathered into one abstraction. In the words of Ruby’s own
documentation, IO is “the basis for all input and output in Ruby.” By learning to read
from files, then, you’ll learn both principles and concrete methods that will translate
to all the other places from which you might want to acquire text.
If you know how to write output to the screen, then—using puts—you already know
how to write to a file: by calling puts on the file object. Our screen and a file are both
IO objects—of two different kinds—so the way we interact with them is the same.
This similarity will be very useful throughout our text processing journey.
This might seem an inflexible and impractical way of doing things. After all,
how can we know at how many bytes from the start of the file we’ll find the
Let’s imagine we wanted to dig into this data. We might want to find out what
the warmest week was, or plot the results on a graph, or just show what the
temperature of a particular region was last week. To do any of these things,
we need to parse the data and get it into our script.
First, a quick explanation of the data. The first column contains the date of
the week in which the measurements were taken. The other four columns
represent different regions of the ocean. For each of them we have two num-
bers: the first representing the recorded temperature, and the second repre-
senting the departure from the expected temperature that this recording
represents (the “sea surface temperature anomaly”). In the first row, then,
the first region recorded a temperature in the week of 3 January 1990 of 23.4
degrees, which is an anomaly of -0.4 degrees.
The pleasing visual quality that this data has—the fact that all the columns
in the table line up neatly—will help us in this task. If we were to count the
characters across each line, we’d see that each field started at exactly the
same place in each row. The first column, containing the date of the week in
question, is always twelve characters long. The next number is nine characters
long, always, and the following number is always four characters, regardless
of whether it has a negative sign. This nine/four pattern repeats three more
times for the other three regions.
In trying to get this data into our script, let’s look at how to read the first row
of data.
Previously, we used read in its basic form, without any arguments, which read
the entire file into memory. But if we pass an integer as the first argument,
read will read only that many bytes forward from the current position in the
file.
So, from the start of the file, we could read in each field in the first row as
follows:
noaa-first-row-simple.rb
File.open("data/wksst8110.for") do |file|
puts file.read(10)
4.times do
puts file.read(9)
puts file.read(4)
end
end
# >> 03JAN1990
# >> 23.4
# >> -0.4
# >> 25.1
# >> -0.3
# >> 26.6
# >> 0.0
# >> 28.6
# >> 0.3
We first read ten bytes, to get the name of the week. Then we read nine bytes
followed by four bytes to extract the numbers, doing this four times so that
we extract all four regions.
From here, it’s not much work to have our script continue through the rest
of the file, slurping up all of the data within and converting it into a Ruby
data structure—in this case, a hash:
noaa-all-rows.rb
File.open("data/wksst8110.for") do |file|
weeks = []
until file.eof?
week = {
date: file.read(10).strip,
temps: {}
}
file.read(1)
weeks
# => [{:date=>"03JAN1990",
# :temps=>
# {:nino12=>{:temp=>23.4, :change=>-0.4},
# :nino3=>{:temp=>25.1, :change=>-0.3},
# :nino34=>{:temp=>26.6, :change=>0.0},
# :nino4=>{:temp=>28.6, :change=>0.3}}},
# {:date=>"10JAN1990",
# :temps=>
# {:nino12=>{:temp=>23.4, :change=>-0.8},
# :nino3=>{:temp=>25.2, :change=>-0.3},
# :nino34=>{:temp=>26.6, :change=>0.1},
# :nino4=>{:temp=>28.6, :change=>0.3}}},
# {:date=>"17JAN1990",
# :temps=>
# {:nino12=>{:temp=>24.2, :change=>-0.3},
# :nino3=>{:temp=>25.3, :change=>-0.3},
# :nino34=>{:temp=>26.5, :change=>-0.1},
# :nino4=>{:temp=>28.6, :change=>0.3}}},
# ...snip...
end
The logic is fundamentally the same as when reading the first row. To loop
over all the rows in the file, there are two main changes: first, we loop until we
hit the end of the file by checking file.eof?; it will return true when the end of
the file is reached and therefore end our loop. The other addition is the call
to file.read(1) at the end of the row; this will consume the newline character at
the end of each line. We’re also using strip to strip the whitespace from the
week name, and to_f to convert the temperature numbers to floats.
This method works and is fast. But by only using read to consume a fixed
numbers of bytes, we haven’t seen the most important advantage of treating
the file in this way: the fact that it offers us random access to the records
within the file.
Well, because each of the columns within the data has a fixed width, that
means that all of the rows do, too. Adding up the columns, including the
newline at the end, gives us 10 + 4 * (9 + 4) + 1 = 63 characters, so we know that
each of our records is 63 bytes long.
If we used seek to skip 63 bytes into the file, then our first call to read would
begin reading from the second record:
noaa-skip-first-row.rb
File.open("data/wksst8110.for") do |file|
file.seek(63)
file.read(10)
# => " 10JAN1990"
end
As we can see, our first call to read returns for us the date of the second week
in the file, not the first. Using this method, we can now skip to arbitrary
records—the first, the tenth, the thousandth, whatever we like—and read
their data.
The most important part of this is that seeking happens in constant time.
That means that it takes the same amount of time no matter how large the
file is and no matter how far into the file we want to seek. We’ve finally
uncovered the amazing benefit to fixed-width files like this—that we gain the
ability to access records within them at random, so it’s no slower to find the
303rd record than it is to find the third—or even the 300,003rd.
In the final version of our script, then, we can write a get_week method that
will retrieve a record for us given an index for that record (1 for the first, 2 for
the second, and so on):
noaa-seek.rb
def get_week(file, week)
file.seek((week - 1) * 63)
week = {
date: file.read(10).strip,
temps: {}
}
week
end
File.open("data/wksst8110.for") do |file|
get_week(file, 3)
# => {:date=>"17JAN1990",
# :temps=>
# {:nino12=>{:temp=>24.2, :change=>-0.3},
# :nino3=>{:temp=>25.3, :change=>-0.3},
# :nino34=>{:temp=>26.5, :change=>-0.1},
# :nino4=>{:temp=>28.6, :change=>0.3}}}
get_week(file, 303)
# => {:date=>"18OCT1995",
# :temps=>
# {:nino12=>{:temp=>20.0, :change=>-0.8},
# :nino3=>{:temp=>24.1, :change=>-0.9},
# :nino34=>{:temp=>25.8, :change=>-0.9},
# :nino4=>{:temp=>28.2, :change=>-0.5}}}
get_week(file, 1303)
# => {:date=>"17DEC2014",
# :temps=>
# {:nino12=>{:temp=>22.9, :change=>0.1},
# :nino3=>{:temp=>26.0, :change=>0.8},
# :nino34=>{:temp=>27.4, :change=>0.8},
# :nino4=>{:temp=>29.4, :change=>1.0}}}
end
Here we use the get_week method to fetch the third, 303rd, and 1,303rd records.
With this method we can treat the data within the file almost as though it
was a data structure within our script—like an array—even though we haven’t
had to read any of it in. This allows us to randomly access data within even
the largest of files in a very fast and efficient way.
One important caveat is that read and seek operate at the level of bytes, not
characters. You’ll learn more about the difference between the two in Chapter
7, Encodings, on page 89, but it’s worth noting that if you’re using a multibyte
character encoding, like UTF-8, then using seek carelessly might leave you in
the middle of a multibyte character and might mean that you get some gib-
berish when you try to read data.
You should therefore use these methods only when you know that you’re
dealing solely with single-byte characters or when you know that the location
you’re seeking to will never be in the middle of a character—as in our temper-
ature data example, where we’re seeking to the boundaries between records.
Despite this limitation of seek, hopefully you can see the benefit of using a
fixed-width file like this. We can retrieve any value, no matter how big the file
is, without reading any unnecessary data; we have what’s called random
access to the data within. To retrieve the tenth record, we just need to seek
567 bytes from the start of the file; to retrieve the 703rd, we just need to seek
44,226 bytes from the start; and so on. The wonderful thing is that no matter
how large our file gets, this operation will always take the exact same amount
of time—even if we’ve got hundreds of megabytes of data. That’s why it’s
sometimes worth putting up with the limitations of such a format: it’s both
very simple and very fast.
Wrapping Up
That’s about it for reading files. We looked at how to open a file and what we
can do with the resulting File object. We covered reading files in one go and
processing them like streams, and why you’d prefer one or the other. We
explored how we can use the methods offered by Enumerable to transform and
manipulate the content of files. We looked at line-by-line processing and
reading arbitrary numbers of bytes, and how we can seek to arbitrary locations
in the file to replicate some of the functionality of a database.
With these techniques, we’ve gained an impressive arsenal for reading text
from files large and small. Next, we’ll take our newfound knowledge of streams
and apply it to another source of text: standard input.
This source of input is called standard input, and it’s one of the foundations
of text processing. Along with its output equivalents standard output and
standard error, it enables different programs to communicate with one
another in a way that doesn’t rely on complex protocols or unintelligible
binary data, but instead on straightforward, human-readable text.
Learning how to process standard input will allow you to write flexible and
powerful utilities for processing text, primarily by enabling you to write pro-
grams that form part of text processing pipelines. These chains of programs,
linked together so that each one’s output flows into the input of the next, are
incredibly powerful. Mastering them will allow you to make the most both of
your own utilities and of those that already exist, giving you the most text
processing ability for the least amount of typing possible.
Let’s take a look at how we can write scripts that process text from standard
input, slotting into pipeline chains and giving us flexibility, reusability, and
power.
Here we ask standard input—$stdin—for a line of input using the gets method,
using chomp to remove the trailing newline. This gives us a string, which we
store in name.
This simplistic use of standard input isn’t particularly useful, let’s face it.
But it’s actually only half of the story. Standard input isn’t just used to read
from the keyboard interactively; it can also read from input that’s been redi-
rected—or piped—to your script from another process.
The ultimate goal here is to be able to use your scripts in pipeline chains.
These are chains of programs strung together so that the output from the
first is fed into the input of the second, the output of the second becomes the
input of the third, and so on. Here’s an example:
$ ps ax | grep ruby | cut -d' ' -f1
That example used preexisting commands to do its work. But we can write
our own programs that slot into such workflows. Imagine that we frequently
wanted to convert sections of text to uppercase. We know how to convert text
to uppercase in Ruby, so we could write a script that works like this:
$ echo "hello world" | ruby to-uppercase.rb
HELLO WORLD
In other words, we could write a program that converts any text it receives
on standard input to uppercase, then outputs that converted text. It won’t
know where the text is coming from (for example, the echo command we saw
previously versus the hostname command)—it accepts anything you pass to it.
This gives you great flexibility in how you use the script, opening up ways of
using it that you might not have foreseen when writing it.
This flexibility is what makes such scripts useful. Your goal, or at least a
pleasant side effect of processing text in this way, is to build up a library of
such scripts so that, if you encounter the same problem again, you can just
slot the script you wrote last time into the new pipeline chain and be on your
way. The to-uppercase.rb script is a good example of this: you might need to write
it from scratch the first time you encounter the problem of converting input
to uppercase, but after that it can be used again and again in completely
different situations.
Saving this script as to-uppercase.rb, we’ve got everything we need. We can run
it like this:
$ echo "hello world" | ruby to-uppercase.rb
HELLO WORLD
$ hostname | ruby to-uppercase.rb
ROB.LOCAL
$ whoami | ruby to-uppercase.rb
ROB
We now have a script that reads from standard input, modifies what it receives,
and outputs it to standard output. It’s general purpose. It doesn’t know or
care where its input comes from, but it processes the input happily regardless.
Countless examples of this type of tool already exist, distributed with Unix-
like operating systems: grep, for example, which outputs only lines that match
a given pattern, or sort, which outputs an alphabetically sorted version of its
input. The scripts you write yourself will be right at home with these standard
Unix utilities as part of your text processing pipelines.
It was also annoying in the previous example that we had to type ruby to-
uppercase.rb. Other commands are short and snappy—cut, grep—but we had to
type what feels like a lot of superfluous information.
For our next example, we’re going to write a script that extracts URLs from
the input passed to it, outputting any that it finds and ignoring the rest of
the input. So, if we passed it the following text:
Alice's website is at http://www.example.com
While Jane's website is at https://example.net and contains a useful blog.
This script will be called urls, and once we’ve written it we’ll be able to use it
in any pipeline we like. Because it will treat its input as a stream, we’ll be
able to use it on whatever input we like, no matter how large it is. So we’ll be
able to extract the URLs from a text file:
$ cat file.txt | urls
The Shebang
Up until now we’ve only run our Ruby scripts by telling the Ruby interpreter
the name of the file to execute. But when we’re using ordinary Unix commands,
such as grep or uniq, we just specify them as commands in their own right.
Ideally, we want to be able to do the same with our URL extractor. It would
be annoying if we had to type ruby urls.rb or something similar each time we
wanted to use it, especially if we’re going to be using it a lot.
But if we just called our script urls, how would our shell know that it was a
Ruby script and know to pass its contents to Ruby to execute? The answer
is, because we tell it to, and we tell it using a special line at the top of our
script called the shebang. In this case, we’d use:
#!/usr/bin/env ruby
The special part is the #!—it’s this that gives the line its name (“hash” +
“bang”). Since the Ruby interpreter might be in different places on different
people’s computers, we use a command called env to tell the shell to use ruby,
wherever ruby might be.
The presence of this shebang allows us to save our script as a file called urls
and run it directly, rather than as ruby urls. The final step in this process is to
allow the file to be executed. We can do this with the chmod command:
$ chmod +x urls
That’s it. We can now call ./urls from within the directory our urls file resides
in, and it will execute our script as Ruby code.
If we wanted to be able to call our version from anywhere, not just from the
directory in which it’s saved, we could put it into a directory that’s within our
PATH—/usr/local/bin, for example. Many people create a directory under their
home directory—typically called bin—and put that into their path, so that they
have a place to keep all of their own scripts without losing them among the
ones that came with their system or that they’ve installed from elsewhere.
Putting the script in a directory that’s in your PATH will make it feel just like
any other text processing command and make it really easy to use wherever
you are. If you think you’ll use a particular script regularly, then don’t hesitate
to put it there. The only thing you need to do is to make sure the name of the
script doesn’t clash with an existing command that you still want to be able
to use—otherwise, you’ll run your script when you type the command, rather
than whatever command originally had that name. So don’t call it ls or mv!
Just like the File objects we saw in the previous chapter, $stdin has an each_line
method that allows us to iterate over the lines in our input:
$stdin.each_line do |line|
# ...
end
mean that we can pass our output along to the next stage in the process as
and when we process it. If our script is the last stage in the pipeline, that
means the user sees output more quickly; and if we’re earlier in the pipeline,
then it means the next part of the pipeline can be doing its processing while
we’re working on our next chunk.
The Logic
Unlike our to-uppercase.rb example, we’re not actually interested in printing the
line of output, even in a modified form. Instead we want to extract any URLs
we find in it and then output those. To do that, we’ll use a regular expression.
We’ll be covering these in depth in Chapter 8, Regular Expressions Basics,
on page 103, so don’t worry too much about them now:
urls
#!/usr/bin/env ruby
$stdin.each_line do |line|
urls = line.scan(%r{https?://\S+})
urls.each do |url|
puts url
end
end
Here we use String’s scan method to extract everything that looks like a URL.
Then, we loop over them—after all, there might be multiple URLs in a single
line—and output each one of them.
Of course, we’re not limited to having our script be the final stage in the
pipeline. We could use it as an intermediary step—for example, to fetch a web
page, extract the URLs from it, and then download each of those URLs:
$ curl http://example.com | urls | xargs wget
Hopefully you can imagine many scenarios where having such a script and
other tools like it would come in handy. Before long, if you’re anything like
me, you’ll have built up quite the collection of them, each in true Unix fashion
built to do one thing—but to do it well.
In reality, though, that’s not the case. All of the programs in the pipeline chain
run simultaneously, and data flows between them bit by bit—just like water
through a real pipe. While the second process is working with the first chunk
of information, the first process is generating another chunk; by the time the
first chunk is through to the third or fourth process in the pipeline, the first
process may be onto the third, tenth, or hundredth chunk.
The amazing thing about this concurrency is that the processes themselves
need know nothing about it. It’s all taken care of by the operating system and
the shell, leaving the individual process to worry only about fetching input
and producing output.
We can prove this concurrency by typing the following into our command
line:
$ sleep 5 | echo "hello, world"
hello, world
If the tasks were executed in series, we’d see nothing for five seconds, and
only then would hello, world appear on our screen. But instead, because the
echo command starts at the same time as sleep, we see the output immediately.
When we request more data from standard input—when calling $stdin.gets, for
example—Ruby will do one of two things. If it has input available in its buffer,
it will pass it on immediately. If it doesn’t, though, it will block, waiting until
the process before it in the pipeline has generated enough output.
This can be frustrating when the input you’re receiving is in many small
chunks, especially if those small chunks are slow to generate. One example
is the find command, which searches the filesystem for files matching given
conditions. It might generate hundreds of filenames per second, or it might
generate one per minute, depending on how many files you’re searching
through and how many of them match your conditions.
If we pipe the result of a find into this script, it will be a long time before the
script actually receives any input, and because this buffering happens at the
output stage, not the input stage, there’s nothing we can do about it. Our
supposedly concurrent pipeline sometimes doesn’t behave concurrently at
all.
While we have no control over the behavior of other programs, if we’re writing
programs ourselves that generate slow output like find does, then we can
remove this buffering by telling our standard output stream to behave syn-
chronously. To illustrate the change, here’s a script that uses the default
behavior and therefore has its output buffered:
stdout-async.rb
100.times do
"hello world".each_char do |c|
print c
sleep 0.1
end
print "\n"
end
then we’ll see the problem: nothing happens for a very, very long time. Because
we’re outputting a character only every 0.1 seconds, it would take us 410
100.times do
"hello world".each_char do |c|
print c
sleep 0.1
end
print "\n"
end
Here we set $stdout.sync to true, telling our standard output stream not to buffer
but instead to flush constantly. If we pipe the input from this script into cat,
we’ll see a character appear every 0.1 second. Although the script will take
the same amount of time in total to execute, the next program in the pipeline
will have the chance to work with the output immediately, potentially speeding
up the overall time the pipeline takes.
Wrapping Up
We looked now at how to use standard input to obtain input from users’
keyboards, how to redirect the output of other programs into our own, and
how powerful text processing pipelines can be. We saw the value of small
tools that perform a single task and how they can be composed together in
different ways to perform complex text processing tasks. We learned how to
write scripts that can be directly executed and that can process standard
input as a stream and so can work with large quantities of input.
Shell One-Liners
We’ve looked at processing text in Ruby scripts, but there exists a stage of
text processing in which writing full-blown scripts isn’t the correct approach.
It might be because the problem you’re trying to solve is temporary, where
you don’t want the solution hanging around. It might be that the problem is
particularly lightweight or simple, unworthy of being committed to a file. Or
it might be that you’re in the early stages of formulating a solution and are
just trying to explore things for now.
Such processing pipelines will inevitably make use of standard Unix utilities,
such as cat, grep, cut, and so on. In fact, those utilities might actually be suffi-
cient—tasks like these are, after all, what they’re designed for. But it’s common
to encounter problems that get just a little too complex for them, or that for
some reason aren’t well suited to the way they work. At times like these, it
would nice if we could introduce Ruby into this workflow, allowing us to
perform the more complex parts of the processing in a language that’s familiar
to us.
It turns out that Ruby comes with a whole host of features that make it a
cinch to integrate it into such workflows. First, we need to discover how we
can use it to execute code from the command line. Then we can explore dif-
ferent ways to process input within pipelines and some tricks for avoiding
lengthy boilerplate—something that’s very important when we’re writing
scripts as many times as we run them!
This will execute the code found in foo.rb, but otherwise it won’t do anything
too special. If you’ve ever written Ruby on the command line, you’ll definitely
have started Ruby in this way.
What you might not know is that by passing options to the ruby command,
you can alter the behavior of the interpreter. There are three key options that
will make life much easier when writing one-liners in the shell. The first is
essential, freeing you from having to store code in files; the second and third
allow you to skip a lot of boilerplate code when working with input. Let’s take
a look at each them in turn.
When it comes to using Ruby in the shell, this is hugely limiting. We don’t
want to have to store code in files; we want to be able to compose it on the
command line as we go.
By using the -e flag when invoking Ruby, we can execute code that we pass
in directly on the command line—removing the need to commit our script to
a file on disk. (It might be helpful to remember -e as standing for evaluate,
because Ruby is evaluating the code we pass contained within this option.)
The universal “hello world” example, then, would be as follows:
$ ruby -e 'puts "Hello world"'
Hello world
Any code that we could write in a script file can be passed on the command
line in this way. We could, though it wouldn’t be much fun, define classes
and methods, require libraries, and generally write a full-blown script, but
in all likelihood we’ll limit our code to relatively short snippets that just do a
couple of things. Indeed, this desire to keep things short will lead to making
choices that favor terseness over even readability, which isn’t usually the
choice we make when writing scripts.
This is the first step toward being able to use Ruby in an ad hoc pipeline: it
frees us from having to write our scripts to the filesystem. The second step
is to be able to read from input. After all, if we want our script to be able to
behave as part of a pipeline, as we saw in the previous chapter, then it needs
to be able to read from standard input.
The obvious solution might be to read from STDIN in the code that we pass in
to Ruby, looping over it line by line as we did in the previous chapter:
$ printf "foo\nbar\n" | ruby -e 'STDIN.each { |line| puts line.upcase }'
FOO
BAR
But this is a bit clunky. Considering how often we’ll want to process input
line by line, it would be much nicer if we didn’t have to write this tedious
boilerplate every time. Luckily, we don’t. Ruby offers a shortcut for just this
use case.
This means that the code we pass in the -e argument is executed once for
each line in our input. The content of the line is stored in the $_ variable. This
is one of Ruby’s many global variables, sometimes referred to as cryptic globals,
and it always points to the last line that was read by gets.
[code language="session"]
$ printf "foo\nbar\n" | ruby -ne 'print'
foo
bar
This implicit behavior is particularly useful for filtering down the input to
only those lines that match a certain condition—only those that start with f,
for example:
$ printf "foo\nbar\n" | ruby -ne 'print if $_.start_with? "f"'
foo
This kind of conditional output can be made even more terse with another
shortcut. As well as print, regular expressions also operate implicitly on $_.
We’ll be covering regular expressions in depth in Chapter 8, Regular Expres-
sions Basics, on page 103, but if in the previous example we changed our
start_with? call to use a regular expression instead, it would read:
This one-liner is brief almost to the point of being magical; the subject of both
the print statement and the if are both completely implicit. But one-liners like
this are optimized more for typing speed than for clarity, and so tricks like
this—which have a subtlety that might be frowned upon in more permanent
scripts—are a boon.
There are also shortcut methods for manipulating input. If we invoke Ruby
with either the -n or -p flag, Ruby creates two global methods for us: sub and
gsub. These act just like their ordinary string counterparts, but they operate
on $_ implicitly.
This means we can perform search and replace operations on our lines of
input in a really simple way. For example, to replace all instances of COBOL
with Ruby:
$ echo 'COBOL is the best!' | ruby -ne 'print gsub("COBOL", "Ruby")'
Ruby is the best!
We didn’t need to call $_.gsub, as you might expect, since the gsub method
operates on $_ automatically. This is a really handy shortcut.
“The letters themselves ... possess all the charm and gossipy
interest of their time that the letters of Horace Walpole contained
a century later.” Laurence Burnham.
+ Bookm. 26: 101. S. ’07. 360w. (Review of 4 v. ed.)
+
Of the seven tales told by old friends at the club four are
psychological romances, stories of that mental borderland
suggested by the book’s title. “A sleep and a forgetting” tells of a
strange lapse of memory in a young girl; “The eidolons of Brooks
Alford” concerns the visions of a broken down professor and the
pretty widow who disperses them; “A memory that worked over
time” is a confusion of memory and imagination; and “A case of
metaphantasmia” enters into the question of dream-
transference. The three stories which conclude the book,
“Editha,” “Braybridge’s offer,” and “The chick of the Easter egg”
are plain day-light stories, a protest against war, a speculation as
to the average proposal, and an amusing Easter comedy.
“Part 1., which fills 217 out of a total of 419 pages, deals with
the securing of water supplies from various sources, and the
selection and installation of pumps; Part 2, 167 pages, discusses
more particularly the various features of management and
maintenance, but also necessarily contains much that relates to
construction work; and Part 3, 35 pages, treats from various
points of view the subjects of franchise, water rates and
depreciation.”—Engin. N.
The study looks into the origin and nature of conscience, its
means of education and enlightenment, and finally considers the
grounds for the present and perpetual authority of conscience.
“By far the most valuable portions of the book are those which
deal with the distinctly human side of the subject—the conditions
of pioneer existence with which the emigrant had to wrestle, the
life of flatboatman and trader, the reign of outlaw and rowdy, the
intermingling of racial elements, and particularly the jealous
contact of Yankee and Virginian on the north and south banks of
the river. So far as political history is concerned, the student will
find nothing new. The book is unfortunately subject to the
limitations and defects of a hasty and somewhat scrappy
narrative.” Frederic Austin Ogg.
+ Am. Hist. R. 12: 662. Ap. ’07. 790w.
+
−
“A useful survey, not scientific, but helpful in illustrating the
successive phases of social life on the river.”
+ A. L. A. Bkl. 3: 68. Mr. ’07.
“Mr. Hulbert brings to his work unusual qualifications, for he
unites a local interest and pride in the region of which he writes,
with a large perspective, and accuracy and perseverance in
research with picturesque and pungent style.”
+ Dial. 41: 395. D. 1, ’06. 320w.
+
“Fewer extracts and more concise treatment would make for
vividness, but the book, with its excellent illustrations, shows
careful research and gives a thoro knowledge of the region with
which it deals.”
+ Ind. 62: 100. Ja. 10, ’07. 220w.
+
−
“Comes near to being a model of what such a book ought to
be.”
+ Ind. 63: 1233. N. 21, ’07. 140w.
+
“Mr. Hulbert has made what we are inclined to think is a most
intrinsically important addition yet made to the Messrs. Putnam’s
series.”
+ Lit. D. 33: 727. N. 17, ’06. 140w.
+
“There is no chapter in this book which is not of historical
interest and value. But without depreciating its genuine worth, it
must be said that the treatment should have been more
systematic and complete.”
+ Nation. 84: 60. Ja. 17, ’07. 910w.
+
−
“On the whole the author has produced a volume of great
historic value and interest.”
+ N. Y. Times. 12: 12. Ja. 5, ’07. 2300w.
+
−
“His vivid description of the scenery and the people, and his
observations on art, history and archaeology make up a book of
more than usual interest and charm.”
+ A. L. A. Bkl. 3: 167. O. ’07. S.
“The easy, flowing style of the book takes one from one scene
to another without effort, and the vivid descriptions enable the
reader to ‘see without traveling.’”
+ Ann. Am. Acad. 30: 594. N. ’07. 140w.
“The book is charmingly illustrated, and abounds in engaging,
sincere enthusiasm.”
+ Ath. 1907, 1: 350. Mr. 23. 190w.
“Whatever Mr. Hume describes in and about Oporto, Bussaco,
Coimbra, Alcoboca, Cintra, Lisbon, or places of lesser note, is
done with a well-considered and creditable enthusiasm, and in
an unusually graceful style.”
+ Dial. 42: 373. Je. 16, ’07. 200w.
“It ought to be a revelation to those who know Portugal only
from a guide book, or who think of it only as an unimportant
strip of seashore to be neglected for royal Spain.”
+ Nation. 85: 236. S. 12, ’07. 490w.
“The fault we have to find with the clever sketches in colour is
that they are somewhat faint in tint and rather too much en
vignette.”
+ Sat. R. 103: 434. Ap. 6, ’07. 190w.
−
ebookgate.com