Fiddling with files & bots
Yesterday the September edition of the Amsterdam.rb Ruby meetup took place at WeWork, with insightful talks by WeTransfer's Julik Tarkhanov and GitHub Developer Advocate Don Goodman Wilson.
It's a TIFF. Adventures in file format
Julik shared a rather new feature with us, image previews on WeTransfer. Enabled by default in most countries, WeTransfer attempts to test the load on their servers. The amount of image transfers over the platform runs in the millions per day, the exact number Julik is not at liberty to say.
Julik Tarkhanov
At a certain point 'just download' was not enough, says Julik. WeTransfer wants people to be able to present their work and to offer them choices: what to download if you're pressed for time? What we need to know about a file, without downloading it in full, is the file format, the resolution for images, compositions for archives, duration for video/audio and page count for PDFs.
Before loading an image into your image processing library you want to know how large it's going to be. Assuming 8-bit colorchannels and RGBA, an uncompressed buffer for a 512x512 image is 1048576 bytes (width_px * height_px * 4
). Indeed, file processing can get ugly. It's CPU-intensive, IO-intensive and relies on a multitude of libraries, and libraries always come with a security risk. We can't download entire files and run detection on it.
Detecting files reliably is hard, but not impossible. WeTransfer considered using FastImage, Dimensions, imagesize or exiftool for file properties detection, before writing their own tool.
format_parser had the following requirements:
- Random-access via HTTP using
Range:
- you don't want to inspect bit per bit / straight-ahead scanning only to find out the format information is at the very end, using a ton ofelsif
statements - and support for local files, - One isolated module per supported file format - instead of a page-long case statement,
- Some protection from crafted malicious payloads.
Julik claims that format_parser is on par with FastImage in terms of speed, and orders of magnitude faster when file information is at the tail of the file. WeTransfer even goes as far as to have job-candidates create a module for format_parser as a type of litmus test.
Let's build a GitHub App
Don is more apt to writing C++ code, but he wrote a Sinatra app for the occasion. Don enjoys 'automating all the things' and to that end created an GitHub app that is essentially an issue triage bot, using NLP (Recast.ai) and Octokit.rb.
Don Goodman Wilson
GitHub apps allow you to do anything you can do as a user, but as an app. You can use GitHub apps to build custom workflows (bespoke or os), 3rd party integrations, and native apps.
A little while ago I attended a workshop at GitHub's Amsterdam office to learn how to automatically add labels to issues, using Probot and NodeJS. The Ruby app Don showcases works roughly the same, in that it receives the issues
event, checks whether action
is opened, analyzes the issue title, and assigns a label using the Issues API.
The code for the bot is up on GitHub. Try and install it on a repo, and create a new issue to see the bot working. Or just create an issue on Don's sandbox env. The training material for the bot is 'remixable' as well.
The October 16 Amsterdam.rb meetup will host 'Hubber' and Rails core member Eileen (@eileencodes) and Fotos Georgiadis, Senior Software Engineer at Catawiki.