Recently at work I needed to search through our archived files and provide the results by the end of the day. Here’s the parameters of the request:
- The archive files are encrypted and stored in HDFS (Don’t ask why we store them in HDFS).
- The files vary in size form 3-9 GB.
- The total number of files to search was 300+
- It takes between 1 – 2 minutes to decrypt each file.
In the past there have been requests to search one archived file. In those cases we would copy the file out of HDFS to a server. Then run a shell script to decrypt the file and perform the search. The decrypting program requires 2 arguments: an encrypted file and a file to write the decrypted data to. This means the decrypted and encrypted file are on disk at the same time.
At an average rate of of 1.5 minutes to decrypt a single file, it was going to take 450 minutes (7.5 hours) for 300 files. To add to my dilema, there wasn’t enough time to write custom RecordReader. The only solution would be to stream the files in parallel. But there 2 problems with that approach:
- The server does not have enough space for 20 (10 encrypted and 10 decrypted) files at a time.
- The decrypting code does read from stdin or write to stdout.
What to do? Use named pipes of course!