2020-05-18

"Programming Algorithms" Book Gets its own Page

Recently, I was busy organizing the process of postal delivery of the paperback version of the book to all the interested people. Here, I have created a map showing everyone who has already ordered:

I'm glad to see the global geography of readership and all the positive feedback. Thanks a lot! Please, share your thoughts online :)

Finally, the book got its own webpage with all the relevant details.

2020-05-08

Dead-Tree Version of "Programming Algorithms"

I have finally obtained the first batch of the printed "Programming Algorithms" books and will shortly be sending them to the 13 people who asked for a hardcopy.

Here is a short video showing the book "in action":

If you also want to get a copy, here's how you do it:

  1. Send an email to [email protected] with your postal address — I'll send you a Paypal money request.
  2. Once I see the donation, I'll go to the post office and send you the book.
  3. Optionaly step: if you want it to be signed, please, indicate it in your letter.
Shipping details: As I said originally, the price of the dead-tree version will be $20+shipping. I'll ship via the Ukrainian national post. You can do the fee calculation online here (book weight is 0.58 kg, size is 23 x 17 x 2 cm): https://calc.ukrposhta.ua/international-calculator. Alas, the interface is only in Ukrainian. According to the examples I've tried, the cost will be approximately $10-15. To make it easier, I've just settled on $10 shipping without a tracking number of $15 if you want a tracking number. Regardless of your country. I don't know how long it will take - probably depends on the location (I'll try to inquire when sending).

The book was already downloaded more than 1170 times (I'm not putting the exact number here as it's constantly growing little by little). I wish I knew how many people have, actually, read it in full or in part. I've also received some error corrections (special thanks goes to Serge Kruk), several small reviews and letters of encouragement. Those were very valuable and I hope to see more :)

Greetings from the far away city of Lima, Peru!
I loved this part: "Only losers don’t comment their code, and comments will be used extensively"
Thank you so much for putting this comprehensive collection of highly important data structures, i'm already recommending this to two of my developers, which I hope i'll induce into my Lisp addiction.
--Flavio Egoavil

And here's another one:

Massively impressive book you've written! I've been a Lisp programmer for a long time and truly appreciate the work put in here. Making Lisp accessible for more people in relation to practical algorithms is very hard to do. But you truly made it. You'll definitely end up in the gallery of great and modern Lisp contributions like "Land of Lisp" and "Let Over Lambda". Totally agree with your path to focus on practical algorithmic thinking with Lisp and not messing it up with macros, oop and other advanced concepts.
--Lars Hård

Thanks guys, it's really appreciated!

If you feel the same or you've liked the book in some respect and have found it useful, please, continue to share news about it: that definitely helps attract more readers. And my main goal is to make it as widely read as possible...

2020-04-15

"Programming Algorithms" Book Freely Available

The book "Programming Algorithms (A comprehensive guide to writing efficient programs with examples in Lisp)" has been completed. It turned out to be more than 360 pages in a standard technical book format, with over 100k words (that's 2 NanoWriMos :). It covers more than 100 topics that made it to the TOC — but, actually, more. Phew, making it to this point was quite a challenge...

This book is, surely, not perfect. Hopefully, most of the mistakes in it were fixed with the help of many nice people who commented on the chapters as they were published on my blog.

Also, the book is terribly incomplete. Almost every chapter could be expanded by a factor of two or three with relevant details and concrete implementations of some of the general ideas that are presented, currently. But neither did I have the time to write those down, nor, what's much more important, anyone would have had the time to read them, in entirety. I believe I have put enough concrete examples with executable code to illustrate all the important concepts in each part. This is a great advantage of using Lisp for the book: the code is clear and compact enough to serve both to explain the algorithms and to permit testing them for real, in the REPL. The main compromise each author has to make is between brevity and completeness. I hope that I made the right choices in this regard, but, for sure, there's much more to learn about every piece of technology mentioned. My hope is that the book lays a solid groundwork to facilitate further deeper exploration.

There are also a couple of topics that I would like to cover but couldn't find a good place for them. Probabilistic data structures is the most important of them. Yet, they are not big enough to justify a separate chapter and, also, don't fit into any of the existing chapters.

But enough with the whining :) In fact, I'm quite satisfied with the end result as my main goal was to sufficiently develop the following key themes:

  • The main one, obviously, was the description of all the important data structures and the associated algorithms.
  • The next, also very important, was the demonstration of the essential tools that help in the development, testing, and verification of the produced algorithmic code: tracing, profiling, pretty-printing, etc.
  • We have also discussed, when it was relevant, the real-world engineering considerations and constraints that influence the programs using our algorithms. And sometimes these constraints have more impact than the purely theoretical complexity calculations.
  • Finally, in each chapter, I tried to present the practical use case of the algorithms we have studied, showing the broad variety of such applications. In fact, it spans all the different corners of the software landscape we're used to. We have talked, albeit briefly, about such different domains as neural networks, plagiarism detection, web search, mapping, chess-playing, image compression, and many others.

There is a lot of books on algorithms, but I haven't seen any that primarily aims to bridge the gap between the theory and practice. This is one of the key distinctions of "Programming Algorithms". It is definitely not the best exposition of the theoretical ideas, but I hope that, instead, it builds sufficient understanding and skill, for the common developer, to start writing efficient algorithmic programs.

I wanted to finish the book with the following statement: programming craft is primarily about making choices. What approach to prefer, which algorithm to choose, what tradeoffs to make. And, at the other level, what properties to give more priority: speed or safety, brevity or consistency, space or debuggability, clarity or conciseness, and so on and so forth. Lisp is one of the few languages that are "pro-choice". Its authors understood very well the importance of freedom to make the critical choices, and it is felt in the design of the language. For instance, with the help of declaim we can even signal our preferences to the compiler, to some extent, at the level of a single file or even an individual form. (declaim (optimize (speed 3) (safety 1) (debug 0) (compilation-speed 0))) will ask the compiler to produce the fastest possible code. Yes, this language will not guard you against poor choices like some others claim to do. Sometimes, you're not wise enough to make a correct choice, but, much more often, every choice just has its pros and cons, so someone will approve of it and someone won't. And that's what freedom is about: ownership and responsibility. So, use Lisp if you liked it. And if you prefer other languages, I'd urge you to still take advantage of the concept of freedom of choice in programming. Don't be constrained by the prevailing paradigms and try to use the best parts of all the different approaches you know...

Acknowledgments

Finally, the most pleasant part. I'm very thankful to those who helped me in the work on "Programming Algorithms" by providing support, advice, corrections, and suggestions. First of all, many thanks to my wife Ksenya who encouraged me to work on it despite the time for that is, in part, taken from my family duties. Also, I am very grateful to Dr. Robert Standh who humbly volunteered his help as an editor to make it sound more native (as my English is far from perfect since I'm not a native speaker) and point out the mistakes that I made. He and his wife had contributed lots of improvements to more than half of the chapters, and I tried to follow their advice in the subsequent ones. Thanks to Rainer Joswig for commenting on the Lisp choices. Thanks to @dzenicv on reddit who posted links to almost all of the chapters there, which triggered some helpful discussions. Thanks to @tom_mellior on Hacker News for pointing a serious deficiency in the explanation of Union-Find. Thanks to all those people who shared the links to the chapters, contributed their comments and attention.

If you've found the book helpful and interesting, you can also support it's past and (potential) further development in several ways. First and foremost, you can share it with your friends, colleagues, social network. The book was made free and will remain free as its main premise, for me, was to spread the knowledge gathered inside. Yet, you can also make a donation at Leanpub if you believe that it has brought you some value. Last but not least, if you find some way the book can be improved, don't hesitate to contact me.

Finally, I wanted to solicit reviews: if you've read the book and liked it, please, write a paragraph or two to let others know your opinion!

2020-03-30

Programming Algorithms: Synchronization

This is a snippet of the "Synchronization" chapter of the book.

This is the final chapter of the book, in which we will discuss optimization of parallel computations: whether concurrently on a single machine in and a shared-memory setting or in a distributed shared-nothing environment. This is a huge topic that spans synchronization itself, parallelization, concurrency, distributed computations, and the functional approach. And every senior software developer should be well-versed in it.

Usually, synchronization is studied in the context of system or distributed programming, but it has a significant algorithmic footprint and is also one of the hottest topics for new algorithm research. In fact, there are whole books that concentrate on it, but, usually, they attack the problem from other angles, not focusing on the algorithmic part. This chapter will be more algorithm-centered, although it will also present an overview of the problem space. SO that, in the end, you'll have a good foundation to explore the topic further if a desire or need for that will appear.

Let's start from the basics. In the previous chapters of the book we, mainly, viewed algorithms as single computations running without external interruptions. This approach, obviously, removes the unnecessary complexity, but it also isn't totally faithful to reality. Most of the programs we deal with, now, run in multiprocessing environments (sometimes, even distributed ones), and even when they don't utilize these capabilities, those are still available, they sometimes have their impact, and, besides, might have improved the performance of the programs if they would have been utilized. The majority of the backend stuff, which, currently, is comprised of services running in the datacenters, are multithreaded. There's a notorious "Zawinski's Law" that states that every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can. Being a good joke it also reflects an important truth about the tendency of all programs over time to become network-aware, and thus distributed to at least some extent.

There are two principally different types of environments in which the programs that need synchronization run: shared-memory and shared-nothing ones.

In a shared-memory setting, there exists some shared storage (not necessarily, RAM) that can be directly accessed by all the threads of the application. Concurrent access to data in this shared memory is the principal source of the synchronization challenges, although not the only one. The example of a shared-memory program is a normal application that uses multithreading provided either directly by the OS or, more frequently, by the language runtime.

The opposite of shared-memory is a shared-nothing environment, in which all threads don't have any common data storage and can coordinate only by sending messages directly to other processes. The contents of the messages have to be copied from the memory of the sender to the receiver. In this setting, some of the synchronization problems disappear, but others still remain. At the fundamental level, some synchronization or coordination still needs to happen. From a performance standpoint, however, the shared-nothing mode is, usually, inferior due to the need for additional data copying. So, both paradigms have their place and the choice, which one to utilize, depends on the context of a particular task.

The main goal of synchronization is ensuring program correctness when multiple computations are running in parallel. Another side of the coin is achieving optimal performance, which is also addressed by parallelization that we have somewhat discussed in a couple of prior chapters. Prioritizing performance before correctness, although tempting, is one of the primary sources of bugs in the concurrent systems. The trivial example would be building a shared-memory program without explicit use of any synchronization mechanisms. It is, definitely, the most performant approach, but non-coordinated access to the shared data will inevitably result in failures like data corruption.

More details about of the book may be found on its website.

2020-02-19

Programming Algorithms: Compression

This is a snippet of the "Compression" chapter of the book.

Compression is one of the tools that every programmer should understand and wield confidently. Such situations when the size of the dataset is larger than the program can handle directly and it becomes a bottleneck are quite frequent and can be encountered in any domain. There are many forms of compression, yet the most general subdivision is between lossless one which preserves the original information intact and lossy compression which discards some information (assumed to be the most useless part or just noise). Lossless compression is applied to numeric or text data, whole files or directories — the data that will become partially or utterly useless if even a slight modification is made. Lossy compression, as a rule, is applied to data that originates in the "analog world": sound or video recordings, images, etc. We have touched the subject of lossy compression slightly in the previous chapter when talking about such formats as JPEG. In this chapter, we will discuss the lossless variants in more detail. Besides, we'll talk a bit about other, non-compressing, forms of encoding.

Encoding

Let's start with encoding. Lossless compression is, in fact, a form of encoding, but there are other, simpler forms. And it makes sense to understand them before moving to compression. Besides, encoding itself is a fairly common task. It is the mechanism that transforms the data from an internal representation of a particular program into some specific format that can be recognized and processed (decoded) by other programs. What we gain is that the encoded data may be serialized and transferred to other computers and decoded by other programs, possibly, independent of the program that performed the encoding.

Encoding may be applied to different semantic levels of the data. Character encoding operates on the level of individual characters or even bytes, while various serialization formats deal with structured data. There are two principal approaches to serialization: text-based and binary. The pros and cons are the opposite: text-based formats are easier to handle by humans but are usually more expensive to process, while binary variants are not transparent (and so, much harder to deal with) but much faster to process. From the point of view of algorithms, binary formats are, obviously, better. But my programming experience is that they are a severe form of premature optimization. The rule of thumb should be to always start with text-based serialization and move to binary formats only as a last resort when it was proven that the impact on the program performance will be significant and important.

Base64

Encoding may have both a reduction and a magnification effect on the size of the data. For instance, there's a popular encoding scheme — Base64. It is a byte-level (lowest level) encoding that doesn't discriminate between different input data representations and formats. No, the encoder just takes a stream of bytes and produces another stream of bytes. Or, more precisely, bytes in the specific range of English ASCII letters, numbers, and three more characters (usually, +, /, and =). This encoding is often used for transferring data in the Web, in conjunction with SMTP (MIME), HTTP, and other popular protocols. The idea behind it is simple: split the data stream into sextets (6-bit parts — there's 64 different variants of those), and map each sextet to an ASCII character according to a fixed dictionary. As the last byte of the original data may not align with the last sextet, an additional padding character (=) is used to indicate 2 (=) or 4 (==) misaligned bits. As we see, Base64 encoding increases the size of the input data by a factor of 1.25.

Here is one of the ways to implement a Base64 serialization routine:

(defparameter *b64-dict*
  (coerce (append (loop :for ch :from (char-code #\A) :to (char-code #\Z)
                        :collect (code-char ch))
                  (loop :for ch :from (char-code #\a) :to (char-code #\z)
                        :collect (code-char ch))
                  (loop :for ch :from (char-code #\0) :to (char-code #\9)
                        :collect (code-char ch))
                  '(#\+ #\/ #\=))
          'simple-vector))

(defun b64-encode (in out)
  (let ((key 0)
        (limit 6))
    (flet ((fill-key (byte off beg limit)
             (:= (ldb (byte limit off) key)
                 (ldb (byte limit beg) byte))
             (:= off (- 6 beg)))
           (emit1 (k)
             (write-byte (char-code (svref *b64-dict* k)) out)))
      (loop :for byte := (read-byte in nil) :while byte :do
        (let ((beg (- 8 limit)))
          (fill-key byte 0 beg limit)
          (emit1 key)
          (fill-key byte (:= limit (- 6 beg)) 0 beg)
          (when (= 6 beg)
            (emit1 key)
            (:= limit 6))))
      (when (< limit 6)
        (:= (ldb (byte limit 0) key)
            (ldb (byte limit 0) 0))
        (emit1 key)
        (loop :repeat (ceiling limit 2) :do
          (emit1 64))))))

This is one of the most low-level pieces of Lisp code in this book. It could be written in a much more high-level manner: utilizing the generic sequence access operations, say, on bit-vectors, instead of the bit manipulating ones on numbers. However, it would be also orders of magnitude slower due to the need to constantly "repackage" the bits, converting the data from integers to vectors and back. I also wanted to show a bit of bit fiddling, in Lisp. The standard, in fact, defines a comprehensive vocabulary of bit manipulation functions and there's nothing stopping the programmer from writing performant code operating at a single bit level.

One important choice made for Base64 encoding is the usage of streams as the input and output. This is a common approach to such problems based on the following considerations:

  • It is quite easy to wrap the code so that we could feed/extract strings as inputs and outputs. Doing the opposite, and wrapping a string-based code for stream operation is also possible, but it defeats the whole purpose of streams, which is...
  • Streams allow to efficiently handle data of any size and not waste memory, as well as CPU, for storing intermediary copies of the strings we're processing. Encoding a huge file is a good illustration of why this matters: with streams, we do it in an obvious manner: (with-open-file (in ...) (with-out-file (out) (base64-encode in out)). With strings, however, it will mean, first, reading the file contents into memory — and we may not even have enough memory for that. And, after that, filling another big chunk of memory with the encoded data. Which we'll still, probably, need to either dump to a file or send over the network.

So, what happens in the code above? First, the bytes are read from the binary input stream in, then each one is slashed into 2 parts. The higher bits are set into the current base64 key, which is translated, using b64-dict, into an appropriate byte and emitted to the binary output stream out. The lower bits are deposited in the higher bits of the next key in order to use this leftover during the processing of the next byte. However, if the leftover from the previous byte was 4 bits, at the current iteration, we will have 2 base64 bytes available as the first will use 2 bits from the incoming byte, and the second will consume the remaining 6 bits. This is addressed in the code block (when (= 6 beg) ...). The function relies on the standard Lisp ldb operation which provides access to the individual bits of an integer. It uses the byte-spec (byte limit offset) to control the bits it wants to obtain.

Implementing a decoder procedure is left as an exercise to the reader...

Taking the example from the Wikipedia article, we can see our encoding routine in action (here, we also rely on the FLEXI-STREAMS library to work with binary in-memory streams):

CL-USER> (with-input-from-string (str "Man i")
           (let ((in (flex:make-in-memory-input-stream 
                      (map 'vector 'char-code
                           (loop :for ch := (read-char str nil) :while ch :collect ch))))
                 (out (flex:make-in-memory-output-stream)))
             (b64-encode in out)
             (map 'string 'code-char (? out 'vector))))
"TWFuIGk="

This function, although it's not big, is quite hard to debug due to the need for careful tracking and updating of the offsets into both the current base64 chunk (key) and the byte being processed. What really helps me tackle such situations is a piece of paper that serves for recording several iterations with all the relevant state changes. Something along these lines:

        M (77)    |    a (97)     |    n (110)
   0 1 0 0 1 1 0 1|0 1 1 0 0 0 0 1|0 1 1 0 1 1 1 0
0: 0 1 0 0 1 1    |               |                 19 = T
               0 1|               |
1:             0 1|0 1 1 0        |                 22 = W
                  |        0 0 0 1|
2:                |        0 0 0 1|0 1               5 = F

Iteration 0:

beg: 2
off: 0
limit: 6

beg: 0
off: (- 6 2) = 4
limit: 2


Iteration 1:

beg: 4
off: 0
limit: 4

beg: 0
off: (- 6 4) = 2
limit: 4

Another thing that is indispensable, when coding such procedures, is the availability of the reference examples of the expected result, like the ones in Wikipedia. Lisp REPL makes iterating on a solution and constantly rechecking the results, using such available data, very easy. However, sometimes, in makes sense to reject the transient nature of code in the REPL and record some of the test cases as unit tests. As the motto of my test library SHOULD-TEST declares: you should test even Lisp code sometimes :) The tests also help the programmer to remember and systematically address the various corner cases. In this example, one of the special cases is the padding at the end, which is handled in the code block (when (< limit 6) ...). Due to the availability of a clear spec and reference examples, this algorithm lends itself very well to automated testing. As a general rule, all code paths should be covered by the tests. If I were to write those tests, I'd start with the following simple version. They address all 3 variants of padding and also the corner case of an empty string.

(deftest b64-encode ()
  ;; B64STR would be the function wrapped over the REPL code presented above
  (should be blankp (b64str ""))
  (should be string= "TWFu" (b64str "Man"))
  (should be string= "TWFuIA==" (b64str "Man "))
  (should be string= "TWFuIGk=" (b64str "Man i")))

Surely, many more tests should be added to a production-level implementation: to validate operation on non-ASCII characters, handling of huge data, etc.

More details about of the book may be found on its website.

2020-02-10

prj-nlp v.3

Ми з Мар'яною Романишин розпочали третій набір курсу з обробки людської письмової мови (чи як краще перекласти NLP :-p) в Проджекторі. Хотів написати трохи деталей про нього, бо курс нам дуже подобається і, звісно, кожного року хочеться, щоб його рівень продовжував зростати. А для цього треба, щоб потенційні студенти про нього знали і розуміли, як він влаштований.

Курс є трьохмісячним інтенсивом, який ставить на меті підготувати спеціаліста, здатного самостійно і якісно вирішувати NLP-задачі будь-якої складності. Як наодинці, так і в команді. Для того, щоб бути максимально успішним в ньому, треба мати певну базу. Як правило, найкраще себе показують, звісно, програмісти. Але є й винятки: ми залюбки беремо лінгвістів, журналістів, та й, загалом, спеціалістів з інших галузей за умови, що вони володють достатніми навиками програмування, щоб самостійно писати програми, налаштовувати зручне для себе середовище розробки і розуміти базові алгоритмічні концепції. Для курсу не треба знати ML, хоча якесь уявлення про нього бажано мати. Але курс побудований так, що ми почергово розбираємо NLP-теми і пов'язані з ними розділи ML. Звісно, це не значить, що в результаті людина буде гарно розбиратись у машинному навчанні, але необхідну базу для продовження поглиблення у цій сфері отримає.

Другою передумовою успіху на курсі є наявність достатнього часу. Ми кажемо, що необхідний мінімум — це 10 годин самостійної роботи на тиждень, плюс 5 годин на заняттях. Іншими словами, враховуючи час на дорогу, це вже пів робочих ставки. Але, звісно, комусь може знадобитись і більше самостійного часу. Крім того мозок буде досить сильно завантажений новими темами, тому на час занять доведеться відмовитись від інших додаткових проєктів, хоббі і т.п. Також не дуже добре виходить, якщо більше тижня підряд випадає з якоїсь зовнішньої причини: хвороби, відрядження, шлюбу, народження дітей... :)

Як побудований цей курс? Ми збираємо групу з 15 студентів і зустрічаємось два рази на тиждень: одне заняття — теоретичне у четвер увечері, присвячене розбору певної теми, друге — практичне у суботу, на якому ми показуємо приклад вирішення задачі по цій теми і разом програмуємо його. У більшості випадків, ця програма буде основою для розв'язку більш просунутої, але подібної задачі, яка дається на домашню роботу. Відповідно, у нас є 12 тижнів основної роботи, тобто 12 тем і близько 10 повноцінних домашніх проєктів рівня побудувати систему аналізу тональності, перевірки фактів чи синтаксичного розбору. Звісно, в умовах обмеженого часу, кожний з проєктів робиться в рамках певного обмеженого домену.

Курс розбитий на 3 частини:

  • перший місяць — це дуже ґрунтовна підготовча робота: основи структурної лінгвістики, робота з даними, метрики, правильні експерименти, підхід на правилах. В кінці місяця в голові у студента має сформуватись цілком структуроване уявлення про те, як правильно підходити до вирішення NLP-задач. Інший результат цієї частини — сформульоване завдання для курсового і початок роботи над ним: зібрані перші дані, визначена метрика, пророблений план експериментів
  • другий місяць — це занурення в класичне NLP з паралельною проробкою наійбільш розповсюджених ML-технік, які в ньому використовуються. В кінці місяця, після вирішення на лекціях, практичних і вдома майже десятка NLP-задач у студентів вже мають сформуватись навички для самостійного застосування цих технік у реальних проєктах. Ну і зроблена основна частина курсової роботи
  • останній місяць — це deep learning в NLP. Ми одразу попереджаємо, що цей курс не ставить на меті розказати побільше найгарячішого і проривного: для цього є достатньо інших майданчиків. Ми хочемо сформувати систематичне розуміння NLP, з всією його 70-річною історією. Бо в цій історії є дуже багато корисних і, можливо, timeless речей. Тож до state-of-the-art ми підходимо тільки під кінець (і на останньому занятті у нас виступає запрошений лектор, який розказує щось про bleeding edge :) Але принципові речі, пов'язані з DL, ми також пророблюємо як на заняттях, так і в рамках курсового. Ті зі студентів, кого цікавить саме ця сфера, під кінець курсу ганяють навчання десяток нейронок в своїх пісочницях, а також випробовують можливості глибинного навчання у своєму курсовому проєкті.. Втім, тут ми не можемо похвалитись приголомшливими результатами по якості, адже для їх досягнення замало пари тижнів, які по-максимуму є на ту чи іншу задачу: навчання глибинних моделей потребує багато як обчислювальних ресурсів, так і часових. Але тому, як до цього підходити, ми навчаємося

Як можна побачити з цього опису, дуже велику увагу ми приділяємо курсовому проєкту, роботу над яким стимулюємо кожного тижня. В результаті, у більшості виходять досить непогані і цікаві штуки, понад 70% студентів доходять до фінішу з якісною завершеною роботою (нечувана кількість для інших курсів, в яких мені доводилось брати участь). Деякі з проєктів навіть виходять у великий світ: хтось робить речі, пов'язані з роботою, хтось — з хоббі. За 2 роки у нас було 2 дослідження для журналістики даних, проєкт з аналізу конфліктів ліків між собою та з хронічними хворобами людини (на основі обробки інструкцій), система пошуку у соціальному застосунку для конференцій з запитами природньою мовою. Був і ряд цікавих проєктів для себе, які досягли класних результатів по якості та були зроблені з душею. Все це студенти презентують після закінчення на великій фінальній вечірці в офісі Grammarly.

Одна з основних цілей цього курсу для нас полягає в тому, щоб вирощувати українську NLP-спільноту. Адже школи комп'ютерної лінгвістики у нас, по суті, ніколи не було. І ми сподіваємось, що нам вдастся долучитись до її формування разом з іншими прогресивними проєктами навчання в цій сфері, зокрема магістерською програмою по Data Science в УКУ. У курсу вже більше 30 випускників, які увійшли до закритого клубу prj-nlp-alumni де ми ділимось цікавими речами та можливостями, а також плануємо періодично зустрічатись у неформальній атмосфері, а не тільки на івентах. Тож сподіваємось на розширення цього клубу ще на половину у червні цього року :)

P.S. До речі, про УКУ. Я також беру участь як викладач і ментор дипломних робіт у курсі NLP в їх програмі. Це трохи інший досвід, ніж цей курс. Звісно, УКУ пропонує більш академічну програму, яка триває довший час. Студенти отримують там гарну і, що важливо, систематичну підготовку з ML та DL. Тому цьому зовсім не треба приділяти увагу на курсі по NLP. З іншого боку, курс більш короткий і читається декількома викладачами, тому в його рамках важче сформувати цілісну картинку, нема можливості організувати такий же рівень занурення і концентрації, як на курсі в Проджекторі. Зате на магістерську роботу у них більше часу, ніж на весь наш курс. Але, найголовніше, що і тут, і там підбираються гарно підготовлені і мотивовані студенти, тож результати в УКУ також виходять гарної якості з великою кількістю цікавих робіт. Деякі з яких мають рівень статей на топові наукові конференції у цій галузі. Хоча мені особисто все-таки більше подобається формат Проджектора, адже він дає можливість відчути дух інтенсивної командної роботи протягом трьох місяців, зітхнувши з полегшенням в кінці у передчутті дев'ятимісячного перепочинку і нової ітерації...

2020-01-11

Programming Algorithms: Approximation

This is a snippet of the "Approximation" chapter of the book.

This chapter will be a collection of stuff from somewhat related but still distinct domains. What unites it is that all the algorithms we will discuss are, after all, targeted at calculating approximations to some mathematical functions. There are no advanced data structures involved, neither is the aim to find a clever way to improve the runtime of some common operations. No, these algorithms are about calculations and computing an acceptable result within the allocated time budget.

Combinatorial Optimization

Dynamic Programming is a framework that can be used for finding the optimal value of some loss function when there are multiple configurations of the problem space that result in different values. Such search is an example of discrete optimization for there is a countable number of states of the system and a distinct value of the cost function we're optimizing corresponding to each state. There are also similar problems that have an unlimited and uncountable number of states, but there is still a way to find a global or local optimum of the cost function for them. They comprise the continuous optimization domain. Why is optimization not just a specialized area relevant to a few practitioners but a toolbox that every senior programmer should know how to utilize? The primary reason is that it is applicable in almost any domain: the problem just needs to be large enough to rule out simple brute force. You can optimize how the data is stored or how the packets are routed, how the blueprint is laid out or the servers are loaded. Many people are just not used to looking at their problems this way. Also, understanding optimization is an important prerequisite for having a good grasp of machine learning, which is revolutionizing the programming world.

DP is an efficient and, overall, great optimization approach, but it can't succeed if the problem doesn't have an optimal substructure. Combinatorial Optimization approaches deal with finding a near-optimum for the problems where an exhaustive search requires O(2^n) computations. Such problems are called NP-hard and a classic example of those is the Travelling Salesman (TSP). The task is to find an optimal order of edges in a cycle spanning all vertices of a fully-connected weighted graph. As we saw previously, this problem doesn't have an optimal substructure, i.e. an optimal partial solution isn't necessarily a part of the best overall one, and so taking the shortest edge doesn't allow the search procedure to narrow down the search space when looking at the next vertex. A direct naive approach to TSP will enumerate all the possible variants and select the one with a minimal cost. However, the number of variants is n!, so this approach becomes intractable very fast. A toy example of visiting all the capitals of the 50 US states has 10^64 variants. This is where quantum computers promise to overturn the situation, but while we're waiting for them to mature, the only feasible approach is developing approximation methods that will get us a good enough solution in polynomial (ideally, linear) time. TSP may look like a purely theoretical problem, but it has some real-world applications. Besides vehicle routing, automated drilling and soldering in electronics is another example. Yet, even more important is that there are many other combinatorial optimization problems, but, in essence, the approaches to solving one of them apply to all the rest. I.e., like with shortest path, coming up with an efficient solution to TSP allows to efficiently solve a very broad range of problems over a variety of domains.

So, let's write down the code for the basic TSP solution. As usual, we have to select the appropriate graph representation. From one point of view, we're dealing with a fully-connected graph, so every representation will work and a matrix one will be the most convenient. However, storing an n^2-sized array is not the best option, especially for a large n. A better "distributed" representation might be useful here. Yet, for the TSP graph, an even better approach would be to do the opposite of our usual optimization trick: trade computation for storage space. When the graph is fully-connected, usually, there exists some kind of an underlying metric space that contains all the vertices. The common example is the Euclidian space, in which each vertex has a coordinate (for example, the latitude and longitude). Anyway, whichever way to represent the vertex position is used, the critical requirement is the existence of the metric that may be calculated at any time (and fast). Under such conditions, we don't have to store the edges at all. So, our graph will be just a list of vertices.

Let's use the example with the US state capitals. Each vertex will be representated as a pair of floats (lat & lon). We can retireve the raw data from the Wikipedia article about the US capitols (with an 'o') and extract the values we need with the following code snippet[1], which cuts a few corners:

(defstruct city
  name lat lon)

(defparameter *wp-link* "https://en.wikipedia.org/w/index.php?title=List_of_state_and_territorial_capitols_in_the_United_States&action=edit&section=1")

(defparameter *cs*
  (with ((raw (drakma:http-request *wp-link*))
         (coords-regex (ppcre:create-scanner "\\{\\{coord\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|type"))
         (capitals (list)))
    (flet ((dms->rad (vec off)
             (* (/ pi 180)
                (+     (? vec (+ off 0))
                    (/ (? vec (+ off 1)) 60)
                    (/ (? vec (+ off 2)) 3600)))))
      (dolist (line (split #\Newline (slice raw
                                            (search "{| class=\"wikitable sortable\"" raw)
                                            (search "</textarea><div class='editOptions'>" raw))))
        (when-it (and (starts-with "|" line)
                      (search "{{coord" line))
          (with ((_ coords (ppcre:scan-to-strings coords-regex line))
                 (coords (map* 'read-from-string coords)))
            (push (make-city :name (slice line (position-if 'alpha-char-p line)
                                          (position-if (lambda (ch) (member ch '(#\] #\|)))
                                                       line :start 1))
                           :lat (dms->rad coords 0)
                           :lon (dms->rad coords 3))
                capitals)))))
    (coerce capitals 'vector)))

CL-USER> (length *cs*)
50

We also need to define the metric. The calculation of distances on Earth, though, is not so straightforward as on a plain. Usually, as a first approximation, the haversine formula is used that provides the estimate of the shortest distance over the surface "as-the-crow-flies" (ignoring the relief).

(defun earth-dist (c1 c2)
  (with ((lat1 (? c1 'lat))
         (lat2 (? c2 'lat))
         (a (+ (expt (sin (/ (- lat2 lat1) 2))
                     2)
               (* (cos lat1)
                  (cos lat2)
                  (expt (sin (/ (- (? c2 'lon) (? c1 'lon)) 2)) 
                        2)))))
    (* 1.2742e7  ; Earth diameter
       (atan (sqrt a) (sqrt (- 1 a)))))) 

With the metric at our disposal, let's define the function that will calculate the length of the whole path and use it for a number of random paths (we'll use the RUTILS function shuffle to produce a random path).

(defun path-length (path)
  (let ((rez (earth-dist (? path 0) (? path -1))))
    (dotimes (i (1- (length path)))
      (:+ rez (earth-dist (? path i) (? path (1+ i)))))
    rez))

CL-USER> (path-length *cs*)
9.451802301259182d7
CL-USER> (path-length (shuffle *cs*))
9.964776273250546d7
CL-USER> (path-length (shuffle *cs*))
1.009761841183094d8

We can see that an average path may have a length of around 10k kilometers. However, we don't know anything about the shortest or the longest one, and to find out reliably, we'll have to evaluate 50! paths... Yet, as we accept the sad fact that it is not possible to do with our current technology, it's not time to give up yet. Yes, we may not be able to find the absolute best path, but at least we can try to improve on the random one. Already, the three previous calculations had a variance of 5%. So, if we're lucky, maybe we could hit a better path purely by chance. Let's try a thousand paths using our usual argmin pattern:

(defun random-search (path n)
  (let ((min (path-length path))
        (arg path))
    (loop :repeat n :do
      (with ((path (shuffle path))
             (len (path-length path)))
        (when (< len min)
          (:= min len
              arg path))))
    (values arg
            min)))

CL-USER> (:= *print-length* 2)
2
CL-USER> (random-search *cs* 1000)
(#S(CITY :NAME "Atlanta" :LAT 0.5890359059538811d0 ...)
 #S(CITY :NAME "Montpelier, Vermont" :LAT 0.772521512027179d0 ...) ...)
7.756170773802838d7

OK, we've got a sizable 20% improvement. What about 1,000,000 combinations?

CL-USER> (time (random-search *cs* 1000000))
Evaluation took:
  31.338 seconds of real time
...
(#S(CITY :NAME "Boise, Idaho" :LAT 0.7612723873453388d0 ...)
 #S(CITY :NAME "Helena, Montana" :LAT 0.813073800024579d0 ...) ...)
6.746660953705506d7

Cool, another 15%. Should we continue increasing the size of the sample? Maybe, after a day of computations, we could get the path length down by another 20-30%. And that's already a good gain. Surely, we could also parallelize the algorithm or use a supercomputer in order to analyze many more variants. But there should be something smarter than simple brute force, right?

More details about of the book may be found on its website.

2019-12-13

Programming Algorithms: Dynamic Programming

This is a snippet of the "Dynamic Programming" chapter of the book.

This chapter opens the final part of the book entitled "Selected Algorithms". In it, we're going to apply the knowledge from the previous chapters in analyzing a selection of important problems that are mostly application-independent and find usages in many applied domains: optimization, synchronization, compression, and similar.

We will start with a single approach that is arguably the most powerful algorithmic technic in use. If we managed to reduce the problem to Dynamic Programming (DP), in most of the cases, we can consider it solved. The fact that we progressed so far in this book without mentioning DP is quite amazing. Actually, we could have already talked about it several times, especially in the previous chapter on strings, but I wanted to contain this topic to its own chapter so deliberately didn't start the exposition earlier. Indeed, strings are one of the domains where dynamic programming is used quite heavily, but the technic finds application in almost every area.

Also, DP is one of the first marketing terms in CS. When Bellman had invented, he wanted to use the then hyped term "programming" to promote his idea. This has, probably, caused more confusion over the years than benefit. In fact, a good although unsexy name for this technic сould be simply "filling the table" as the essence of the approach is an exhaustive evaluating of all variants with memoization of partial results (in a table) to avoid repetition of redundant computations. Obviously, it will have any benefits only when there are redundant computations, which is not the case, for example, with combinatorial optimization. To determine if a problem may be solved with DP we need to validate that it has the optimal substructure property:

A problem has optimal substructure if when we take its subproblem an optimal solution to the whole problem includes an optimal solution to this subproblem.

An example of the optimal substructure is the shortest path problem. If the shortest path from point A to point B passes through some point C and there are multiple paths from C to B, the one included in the shortest path A-B should be the shortest of them. In fact, the shortest path is an archetypical DP problem which we'll discuss later in this chapter. A counterexample is a Travelling Salesman Problem (TSP): if it had optimal substructure the subpath between any two nodes in the result path should have been the shortest possible path between these nodes. But it isn't true for all nodes because it can't be guaranteed that the edges of the path will form a cycle with all the other shortest paths.

Fibonacci Numbers

So, as we said, the essence of DP is filling a table. This table, though, may have a different number of dimensions for different problems. Let's start with a 1d case. What book on algorithms can omit discussing the Fibonacci numbers? Usually, they are used to illustrate recursion, yet they are also a great showcase for the power of memoization. Besides, recursion is, conceptually, also an integral part of DP.

A naive approach to calculating the i-th number will be directly coding the Fibonacci formula:

(defun naive-fib (i)
  (assert (typep i '(integer 0)))
  (if (< i 2) 1
      (+ (naive-fib (- i 1))
         (naive-fib (- i 2)))))

However, applying it will result in an exponential growth of the number of computations: each call to naive-fib results in two more calls. So, the number of calls needed for the n-th number, with this approach, is O(2^n).

> (time (naive-fib 40))
Evaluation took: 3.390 seconds of real time
165580141
> (time (naive-fib 42))
Evaluation took: 7.827 seconds of real time
433494437

Yet, we can see here a direct manifestation of an optimal substructure property: the i-th number calculation uses the result of the (1- i)-th one. To utilize this recurrence, we'll need to store the previous results and reuse them. It may be achieved by changing the function call to the table access. Actually, from the point of view of math, tables and functions are, basically, the same thing.

(let ((fib (vec 1 1)))  ; our table will be an adjustable vector
  (defun fib (i)
    (when (< (length fib) i)
      (vector-push-extend (fib (- i 1)) fib))
    (+ (? fib (- i 1))
       (? fib (- i 2)))))

What we've done here is added a layer of memoization to our function that uses an array fib that is filled with the consecutive Fibonacci numbers. The array is hidden inside the closure of the fib procedure, so it will persist between the calls to it and accumulate the numbers as they are requested. There will also be no way to clear it, apart from redefining the function, as the closed over variables of this kind are not accessible outside of the function. The consecutive property is ensured by the arrangement of the recursive calls: the table is filled on the recursive ascent starting from the lowest yet unknown number. This approach guarantees that each Fibonacci number is calculated exactly once and reduces our dreaded O(2^n) running time to a mere O(n)!

Such a calculation is the simplest example of top-down DP that is performed using recursion. Despite its natural elegance, it suffers from a minor problem that may turn significant, in some cases: extra space consumption by each recursive call. It's not only O(n) in time, but also in space. The alternative strategy that gets rid of redundant space usage is called bottom-up DP and is based on loops instead of recursion. Switching to it is quite trivial, in this case:

(let ((fib (vec 1 1)))
  (defun bottom-up-fib (i)
    (let ((off (length fib)))
      (adjust-array fib (1+ i) :fill-pointer t)
      (dotimes (j (- (1+ i) off))
        (let ((j (+ j off)))
          (:= (aref fib j)
              (+ (aref fib (- j 1))
                 (aref fib (- j 2)))))))
    (aref fib i)))
> (time (bottom-up-fib 42))
Evaluation took: 0.000 seconds of real time
> (time (bottom-up-fib 4200))
Evaluation took: 0.004 seconds of real time
40512746637826407679504078155833145442086707013857032517543...  ; this number is a Lisp's bignum that has unlimited size

Funny enough, a real-word-ready implementation of Fibonacci numbers ends up not using recursion at all...

More details about of the book may be found on its website.

2019-11-20

Programming Algorithms: Strings

This is a snippet of the "Srings" chapter of the book.

It may not be immediately obvious why the whole chapter is dedicated to strings. Aren't they just glorified arrays? There are several answers to these challenges:

  • indeed, strings are not just arrays, or rather, not only arrays: in different contexts, other representations, such as trees or complex combinations of arrays, may be used. And, besides, there are additional properties that are important for strings even when they are represented as arrays
  • there's a lot of string-specific algorithms that deserve their own chapter
  • finally, strings play a significant role in almost every program, so they have specific handling: in the OS, standard library, and even, sometimes, your application framework

In the base case, a string is, indeed, an array. As we already know, this array may either store its length or be a 0-terminated security catastrophe, like in C (see buffer overflow). So, to iterate, strings should store their length. Netstrings are a notable take on the idea of the length-aware strings: it's a simple external format that serializes a string as a tuple of length and contents, separated by a colon and ending with a comma: 3:foo, is the netsrting for the string foo.

More generally, a string is a sequence of characters. The characters themselves may be single bytes as well as fixed or variable-length byte sequences. The latter character encoding poses raises a challenging question of what to prefer, correctness or speed? With variable-length Unicode code points, the simplest and fastest string variant, a byte array, breaks, for it will incorrectly report its length (in bytes, not in characters) and fail to retrieve the character by index. Different language ecosystems address this issue differently, and the majority is, unfortunately, broken in one aspect or another. Overall, there may be two possible solution paths. The first one is to use a fixed-length representation and pad shorter characters to full length. Generally, such representation will be 32-bit UTF-32 resulting in up to 75% storage space waste for the most common 1-byte ASCII characters. The alternative approach will be to utilize a more advanced data-structure. The naive variant is a list, which implies an unacceptable slowdown of character access operation to O(n). Yet, a balanced approach may combine minimal additional space requirements with acceptable speed. One of the solutions may be to utilize the classic bitmap trick: use a bit array indicating, for each byte, whether it's the start of a character (only a 12% overhead). Calculating the character position may be performed in a small number of steps with the help of an infamous, in close circles, operation — Population count aka Hamming weight. This hardware instruction calculates the number of 1-bits in an integer and is accessible via logcount Lisp standard library routine. Behind the scenes, it is also called for bit arrays if you invoke count 1 on them. At least this is the case for SBCL:

CL-USER> (disassemble (lambda (x) 
                        (declare (type (simple-array bit) x))
                        (count 1 x)))

; disassembly for (LAMBDA (X))
; Size: 267 bytes. Origin: #x100FC9FD1A
...
; DA2:       F3480FB8FA       POPCNT RDI, RDX

The indexing function implementation may be quite tricky, but the general idea is to try to jump ahead n characters and calculate the popcount of the substring from the previous position to the current that will tell us the number of characters we have skipped. For the base case of a 1-byte string, we will get exactly where we wanted in just 1 jump and 1 popcount. However, if there were multibyte characters in the string, the first jump would have skipped less than n characters. If the difference is sufficiently small (say, below 10) we can just perform a quick linear scan of the remainder and find the position of the desired character. If it's larger than n/2 we can jump ahead n characters again (this will repeat at most 3 times as the maximum byte-length of a character is 4), and if it's below n/2 we can jump n/2 characters. And if we overshoot we can reverse the direction of the next jump or search. You can see where it's heading: if at each step (or, at least, at each 4th step) we are constantly half dividing our numbers this means O(log n) complexity. That's the worst performance for this function we can get, and it will very efficiently handle the cases when the character length doesn't vary: be it 1 byte — just 2 operations, or 4 bytes — 8 ops.

Here is the prototype of the char-index operation implemented according to the described algorithm (without the implementation of the mb-linear-char-index that performs the final linear scan):

(defstruct (mb-string (:conc-name mbs-))
  bytes
  bitmap)

(defparameter *mb-threshold* 10)

(defun mb-char-index (string i)
  (let ((off 0))
    (loop
      (with ((cnt (count 1 (mbs-bitmap string) :start off :end (+ off i))))
             (diff (- i cnt)))
        (cond
         ((= cnt i) (return (+ off i)))
         ((< diff *mb-threshold*) (return (mb-linear-char-index
                                           string diff off)))
         ((< cnt (floor i 2)) (:+ off i)
                              (:- i cnt))
         (t (:+ off (floor i 2))
            (:- i cnt)))))))

The length of such a string may be calculated by perfoming the popcount on the whole bitmap:

(defun mb-length (string)
  (count 1 (mbs-bitmap string)))

It's also worth taking into account that there exists a set of rules assembled under the umbrella of the Unicode collation algorithm that specifies how to order strings containing Unicode code-points.

More details about of the book may be found on its website.

2019-10-25

Programmatically Drawing Simple Graphviz Graphs

For my book, I had to draw a number of graphs. Obviously, I wanted to have a programmatic way to do that, and Graphviz is the goto-library for that. In Lisp, the interface to Graphviz is cl-dot, but, for me, it wasn't easy to figure out from the manual the "simple and easy" way to use it. I.e. I couldn't find a stupid beginner-level interface, so I had to code it myself. Here's the implementation that allows anyone with a REPL to send to Graphviz lists of edges and obtain graph images.

Generated images: