About

I’m a professional and hobby software developer. I like whale meat, horse meat, rabbit meat, alcoholic drinks, saunas and the Scandinavian type of socialism, and if none of that offends you we’ll probably get along.

PGP key fingerprint: E5D2 3A00 D8C6 93B4 C2AC D93C EC92 D395 260D 3194

All content copyleft Victor Engmark 2004-$(date +%Y).

11 thoughts on “About

    • tl;dr No.

      • It would be a lot of work to save 13 characters (Space + xspf2csv.xsl)
      • Because of the guesstimate increase of one to two orders of magnitude of complexity, the resulting project would be much harder to maintain than the current solution. This would effectively kill any hope of getting anyone to improve the project.
      • It works. Right now.
  1. Hi,

    This is a strange place to post this so feel free to delete my comment afterward, but I couldn’t find any other place to contact you — I saw you post on unix.stackexchange that you had good luck with the Bolse 300mbps USB adapter in Ubuntu and I just wanted to confirm that you were using a recent stock kernel and got good speeds out of the box. Thanks and sorry for the odd message!

    • I can confirm that I am getting good speeds, but bad connectivity in both Windows and Linux. I’ve updated the original post with further information. If you have any further questions, it would be great if you could follow up at the link – Stack Overflow really is a most excellent Q&A site, and your questions are very valid.

      • Thanks! I do love Stack Overflow, I just wasn’t sure if you’d still be receiving notifications on that old thread. Much appreciated.

  2. Thanks for the tip about the shebang – I’ve added it to my post. However, these comments really belong on unix.stackexchange.com, and not here. I will remove them shortly.

    • I wasn’t sure there was much I could add since it’s a general answer to a specific question, and I’m not familiar with neither the specifics of some of the tech involved, the use case, and the budget situation. If the current answer is not useful just let me know; I can delete it so others don’t think it’s already been answered.

      • Hi,

        I am sorry, I didn’t receive any notifications (probably forgot to enable the checkbox) that you answered me here nor in Stackoverflow. Only now I saw that you had edited your answer.

        Well, I can’t say it’s not useful, but I wanted to elaborate some topics you mentioned in your answer.

        I think you may be able to clarify some things even without being familiar with the specifics.

        Firstly, I understand that S3 costs will rise if Firehose writes the data initially and then I somehow aggregate the data and write a new data file to S3. But still, I haven’t calculated specifically, but with EMR’s high prices it would seem that keeping an EMR Spark cluster live would cost more. And even then I would still have to write the data to S3 at some time.

        The biggest concern regarding your answer was about streaming the data to Spark immediately with the smallest possible files. When I had problems reading a few GB worth of those small files, I found out that Spark and hadoop cannot really handle multiple small files, that’s why I wanted to aggregate them. I can understand that this complicates shuffling if injection fails, but I cannot see how “sending data to Spark immediately and archiving S3 objects immediately” helps me? What did you even mean by archiving S3 objects immediately?

        I can try to explain our current situation because it seems to me we might not understand each other exactly.

        We have a GoLang application which handles incoming requests and saves them to an Aerospike database. As we are keeping only a few month worth of data in Aerospike, we also send the data to Firehose which then writes the data to S3. So this is essentially our big data storage. Now we have a front end application from which a user has to be able to see statistics based on this data. Currently he can make requests and see results based on the data from Aerospike. But I am implementing a solution where a same request would initiate a Spark pipeline flow instead of Aerospike. This means that for an incoming request, a new EMR cluster is started, a Scala script is uploaded from S3 to the cluster and started there with request’s parameters (date period and filter/group parameters). Then the Spark Scala script should read in the data. I cannot read in ALL the data we have in S3, it would take too much time. That’s why I wanted daily files because I can read in files by given date period. And that would work way faster if I had 1 file per day, not 96 or even 1440 (if Firehose wrote a new file every minute). Then the script would work on that data and write the results output file to S3.
        If these requests come often, then the EMR cluster would be live and the results would come faster, user would just have to wait for the script to run. But if not, the cluster would terminate itself after some time to reduce costs by not running an idle cluster.
        But of course the biggest problem with this approach is the time it takes for the cluster to start. Which is about 10 minutes roughly.
        So the initial archiving/aggregating data is a separate problem to handling and running the statistics requests.

        So when talking about the general piped solution you provided in your answer with the latest edit, we basically have the same thing. Or the same idea. I explained in my question initially that I wanted to use AWS Lambda to process input at the moment an S3 put/post event is triggered. And by process I mean aggregate the new small file with an existing larger data file. This failed because I couldn’t figure out how to append the contents without reading the two files into the memory (which made me hit memory limits fast). Then I wanted to create a separate Spark script which would be triggered periodically (for example nightly by aggregating previous day’s data) and write daily results in parquet file format so if a statistics request came in, a new spark job would need to read in data from parquet format and this would be faster than reading current files, yes. Even considering the fact that Spark outputs parquet as multiple files as well.

        In that case, would there be a real difference if this spark aggregation to parquet was done periodically (once a day) or triggered every time a new object was added to S3? If it was added every minute, then a new file could come in before the last aggregation is finished.

        Well, that came out a bit long. I am sorry for that wall of text, but I couldn’t figure out everything you said in your SO answer and I wanted to clarify some things :)

        Best regards,
        Veiko

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s