Friday, January 6, 2012

Small data and humble sequentialism

Yeah, big data and massive parallelism in the cloud are exciting!

But there are two problems with the cloud:
  1. Technical: doing stuff in centralized data centers, when all of your users have multiple, hugely overprovisioned computing devices is a bit of a joke.
  2. Political: do you actually want to have all your users' data, and take on the huge responsibilities and risks associated with that?
The cloud is really a reaction to browsers' highly limited request-response interaction style. But browsers have changed now, they're becoming little operating systems.

May I suggest to take a fresh look at small data and humble sequentialism in 2012?


Yew-wei Tan said...

Yeah, I'm still bothered by the fact that I can't even rsync 2 computers sitting right next to each other without some crappy method like Bluetooth, or without some "Smart" central syncing device (maybe on the local network, but typically in the "Cloud")

For my startup, there was a 2 month period that, I was trying to build out a bunch of P2P technologies using both proprietary (Apple's GameKit) and free methods.

On the free side, I tried the typically-proximal methods like Zeroconf. It doesn't worked as advertised.

I then tried a couple implementation of UDP hole punching. Tried and failed using ZeroMQ with UDP broadcast, Tried and failed using the PJSIP library, and it worked intermittently when rolling my own sockets (using libev).

As you've guessed, the MAJOR issue is with networking. And while the IETF tells us how we should behave (and publishes useless specs like RELOAD) it never really is the case in real life.

Simple tcpdumps and more complicated packet captures all lead to the same conclusion that the possible contexts of the device are way to numerous to tackle in any meaningful form.

I really wished someone like Skype would contribute to open-source and hence the state of P2P research, but it is both bad for them financially, and politically, the term "P2P" is so laden with negative connotation.

A little sad really, especially since sockets were designed with the notion of "Equal Peers" as much as "Client/Server".

We'll see, but fundamentally, this is a problem of having a Unique Address for Every component in the Internet that is guaranteed to map to that-and-only-that component.

No idea how we're going to get there, though I envision some "Linux for Networking" -- eg: a free way to provision said Unique Addressing using well-known address in the current infrastructure, which then acts as a service that is essentially a K-V store of addresses to components.

Making that work is obviously a massive challenge, but well, hopefully people are thinking about it (I certainly am!), and maybe we'll see a resolution soon.

But I'd bet closer to 2020 than 2012 =P

Manuel Simoni said...

"you cannot resist an idea whose time has come"

(thanks for your comment)

autre said...


I'm providing a link that might touch on the direction of your general interests

dmbarbour said...

Every `bit` of small data has an origin story that ultimately relates back to `big data`. It seems to me that most computation is about the relationship between the two - i.e. gathering and summarizing data to extract condensed decisions and information.

I posit that scalable data systems ultimately need to maintain a consistent relationship between those small-data views and their big-data origins. That is: only reactive systems can be scalable. Further, `interactive` views (e.g. mutable views, lenses) are necessary to propagate user influence back to the origin.

We don't need clouds for massive parallelism and big data, of course. Consider use of WebCL, to achieve local parallelism in the browser. Local parallelism can help eliminate concerns such as disruption, security, politics.