Rendered at 22:14:35 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
nickmonad 1 days ago [-]
> So you're stuck debugging a system you don't control, through screenshots and copy-pasted logs on a Zoom call.
This is very real.
I work with a deployment that operates in this fashion. Although unfortunately, we can't maintain _any_ connection back to our servers. Pull or push, doesn't matter.
The goal right now is to build out tooling to export logs and telemetry data from an environment, such that a customer could trigger that export on our request, or (ideally) as part of the support ticketing process. Then our engineers can analyze async. This can be a ton of data though, so we're trying to figure out what to compress and how. We also have the challenge of figuring out how to scrub logs of any potentially sensitive information. Even IDs, file names, etc that only matter to customers.
alongub 1 days ago [-]
> Although unfortunately, we can't maintain _any_ connection back to our servers. Pull or push, doesn't matter.
We're working on something for this! Stay tuned.
nodesocket 1 days ago [-]
I also used to work with on-premise installs of Kubernetes and their “security” postures prevented any in-bound access. It was a painful process of requesting access, getting on a zoom call and then controlling their screen via a Windows client and putty. It’s was beyond painful and frustrating. I tried to pitch using a tool like Twingate which doesn’t open any inbound ports, can be locked down very tight using SSO, 2fa, access control rules, and IP limiting but to no avail. They were stuck in their Windows based IT mentally.
alongub 1 days ago [-]
At least they didn't ask you to TeamViewer into a Windows Server 2012 box and open Event Viewer..
stronglikedan 1 days ago [-]
That would be my preference compared to the situation you're replying to. Event Viewer is powerful if one takes some time to learn it.
alongub 1 days ago [-]
Fair point
lelanthran 14 hours ago [-]
For most enterprises there's too many jobs on the line to replace windows.
The people who know where to click and which dialog will pop up and when to click next are never going to agree to replace their non-automatable windows servers with fully automatable linux servers.
I mean, we're talking about a demographic that can't use ssh, never been on a platform using system package managers, and has little to no ability to version system changes.
They do all that manually.
jcgrillo 1 days ago [-]
> This can be a ton of data though, so we're trying to figure out what to compress and how. We also have the challenge of figuring out how to scrub logs of any potentially sensitive information.
This is fundamentally a data modeling problem. Currently computer telemetry data are just little bags of utf-8 bytes, or at best something like list<map<bytes, bytes>>. IMO this needs to change from the ground up. Logging libraries should emit structured data, conforming to a user supplied schema. Not some open-ended schema that tries to be everything to everyone. Then it's easy to solve both problems--each field is a typed column which can be compressed optimally, and marking a field as "safe" is something encoded in its type. So upon export, only the safe fields make it off the box, or out of the VPC, or whatever--note you can have a richer ACL structure than just "safe yes/no".
I applaud the industry for trying so hard for so long to make everything backwards compatible with the unstructured bytes base case, but I'm not sure that's ever really been the right north star.
quesera 1 days ago [-]
Grand solutions require broad coordination, and they often devolve back into a modified-but-equivalent version of the previous problem. :(
Stream-of-bytes is classically difficult model to escape. Many have tried.
jcgrillo 1 days ago [-]
Yeah. There are good reasons things are bad. But there's also a foolish consistency. Like, you can just do things! If you decide monitoring is important you can decide not to outsource it. Most everyone doesn't, though. Probably because they don't think it's very important, and the existing tools get it done well enough, and it's the muscle memory of the subjectively familiar (if objectively fantastically overpriced).
quesera 1 days ago [-]
Well, in the early days of infrastructure growth, when designing bespoke monitoring systems and protocols would be relatively low-cost, it's still nowhere near the highest-ROI way to spend your tech team's time and energy.
And to do it right (i.e. low-risk of of having it blow up with negative effects on the larger business goals), you need someone fairly experienced or maybe even specialized in that area. If you have that person, they are on the team because of their other skills, which you need more urgently.
SaaS, COTS, and open source monitoring tools have to cater to the existing customers. The sales pitch is "easy to integrate". So even they are not incentivized to build something new.
It boils down to the fact that stream-of-bytes is extremely well-understood, and almost always good enough. Infinitely flexible, low-ceremony, no patents, and comes preinstalled on everything (emitters and consumers). It's like HTTP in that way.
And the evolution is similar too. It'll always be stream-of-bytes, but you can emit in JSON or protobuf etc, if it's worth the cognitive overhead to do so. All the hyperscalers do this, even when the original emitter (web servers, etc) is just blindly spewing atrocious CLF/quirky-SSV text.
jcgrillo 22 hours ago [-]
> It'll always be stream-of-bytes, but you can emit in JSON or protobuf etc, if it's worth the cognitive overhead to do so.
This is the crux of it. That's great until you encounter a need for a schema, and then it's "schema-on-read" or some similar abomination. And the need might not manifest until you're pushing like 1TB/day or more of telemetry data with hundreds or thousands of engineers working on some >1MLoC monstrosity. Hard to dig out of that hole.
The situation is tragically optimal--we've achieved some kind of multiobjective local maximum on a rock in the sewer at the bottom of a picturesque alpine valley and declared victory. We should do better.
Or maybe I'm overly optimistic.
quesera 19 hours ago [-]
> The situation is tragically optimal--we've achieved some kind of multiobjective local maximum on a rock in the sewer at the bottom of a picturesque alpine valley and declared victory. We should do better.
But it's a very comfortable rock. pointy in all the right places.
jcgrillo 19 hours ago [-]
til it ain't
gsgreen 1 days ago [-]
Even when you do control the environment, infra isn’t as stable as people think.
Same VPS, same config, but under sustained load you’ll see latency creep or throughput drift depending on the host / routing / neighbors.
Short tests almost never show it — only shows up after a few minutes.
alongub 1 days ago [-]
Right, and that's when you do control the environment. Now imagine debugging that when it's your customer's infra, you have no access, and you're relying on them to copy-paste logs on a Zoom call.
msteffen 1 days ago [-]
IIUC this kind of thing is usually called “managed deployment.” Minio used to have a slick implementation of this, and I think databricks does as well. Usually it’s less “execute arbitrary commands on customer hosts,” and more “send metrics and logs to shared repository and send RPCs to customer deployment”
alongub 1 days ago [-]
It's heavily inspired by Databricks' deployment model. And you're right that it's not "execute arbitrary commands". Commands are predefined functions in the deployed code that the developer defines upfront and customers can review.
The metrics/logs part is also core to Alien... telemetry flows back to the vendor's control plane so you actually have visibility into what's running.
pruthviraja 1 days ago [-]
Interesting approach. The managed self-hosting gap is real..we have run into this exact pain point with kubernetes based deployments where customers modify their cluster configs and things break silently. If I may ask how does Alien handle rollback if an update fails in a customer environment?is there any plan for on-prem/bare metal support beyond the big three clouds?
alongub 1 days ago [-]
Alien is basically a huge state machine where every API call that mutates the environment is a discrete step, and the full state is durably persisted after each one.
If something fails mid-update, it resumes from exactly where it stopped. You can also point a deployment to a previous release and it walks back. This catches and recovers from issues that something like Terraform would just leave in a broken state.
For on-prem: we're working on Kubernetes as a deployment target (e.g. bare metal OpenShift)
rendaw 16 hours ago [-]
How is this different from Terraform? Generally if something fails during a TF apply it saves the state of all the stuff that worked and just retries the thing that failed when you next run it. And reverting your TF stack and doing apply again should walk changes back.
There are specific things where that's not possible, and there are bugs, but it doesn't seem like what you said unless you meant that you just support a limited subset of resources that are known to be robust to reverts? But that's a fairly different claim.
alongub 16 hours ago [-]
The main difference is granularity. Terraform runs a plan and applies it as a batch. If something fails, you re-run apply and it retries from the last saved state... but that state is per-resource, not per-API-call.
Alien tracks state at the individual API call level. A single resource creation might involve 5-10 API calls (create IAM role -> attach policy -> create function -> configure triggers -> set up DNS...). If it fails at step 7, it resumes from step 7. Terraform would retry the entire resource.
The other difference is that Alien runs continuously, not as a one-shot apply. It's a long-running control plane that watches the environment, detects drift, and reconciles. Terraform assumes you run it, it converges, and then nothing changes until you run it again.
pruthviraja 21 hours ago [-]
i think the durable state machine approach is smart...that resume from where it stopped behavior is a big deal during incident response when you really dont want to rerun an entire deployment just because one step failed. K8s as a deployment target would be huge especially for the on-prem enterprise crowd. Will definitely keep an eye on that
alongub 17 hours ago [-]
Thanks so much! If you have any other ideas, I'd really appreciate it if you could shoot them my way (alon AT alien dot dev)
pruthviraja 2 hours ago [-]
For sure.. Thank you
huksley 1 days ago [-]
Is it for managing my software deployed in the customer's cloud environment? Would you support simpler deployment targets, like on premises VMs etc?
At DollarDeploy we developing the platform to deploy apps to VMs with managed services provided, kind of like Vercel for your own servers. Would be interesting to try alien for enterprise customers.
alongub 1 days ago [-]
> Would you support simpler deployment targets, like on premises VMs etc?
same, I think there are a few folks who are starting to see the feasibility and the desirability in hosting their own solutions. I have been working on an idea to solve this, called minima host[0].
It is intended to be simple:
- with the power of a mac mini, you can host (almost) anything
- pay for the mini, it is your machine to do with as you please (we will host it for you)
- if you decide you no longer need hosting, we will mail you back the machine that rightfully belongs to you
if anyone is interested in becoming a partner, shoot me a message, felipe@ind3x.games
It's not RCE. The commands are predefined RPCs written into the deployed code. Customers can review and approve them. Trust between the vendor and the customer is still required and Alien doesn't make it unnecessary.
cassianoleal 14 hours ago [-]
It may not be arbitrary code but it's still remote code execution.
The service provider has direct access to my infrastructure. It's one supply chain attack, one vulnerability, one missed code review away from data exfiltration or remote takeover.
armagnac2 4 hours ago [-]
what better alternative do you have?
It's either you go full SaaS, which means you rely 100% on the vendor, or work like 20 years ago with fully on prem. BYOC is the fine balance imo, that requires proper infra and implementation.
tanki 1 days ago [-]
Such a big pain, I’ve experienced those issue first hand in my last two startups and it took so much time and effort
Super cool product, I’ve gotta try it
dvirsegev 21 hours ago [-]
I’ve been waiting for a solution like this for too long, can’t wait to try it!
munksbeer 1 days ago [-]
I very seldom, if ever, see a "Show HN" title with a suffix of "written in Java" or "written in python" or "written in Go".
"Written in Rust" seems to be a very popular thing to add.
My assumption is that people know it will get the thread more visibility?
irickt 4 hours ago [-]
The signal works both ways. "Written in Java" would save a lot of clicks. So the Java author may omit the label for the same reason. /s
antonvs 1 days ago [-]
I worked for a few years on an on-premise deployment of a system that was otherwise SaaS. Many enterprise customers simply won’t allow something like this - particularly big financials, aviation, healthcare etc.
Realistically, the game ends up being - see what you can get away with until someone notices. Given that, you might want to rename the product to something more boring than “Alien”.
alongub 1 days ago [-]
In practice, unmanaged self-hosting is often less secure, because you end up with outdated versions, unpatched vulnerabilities, and no one responsible for keeping things healthy.
More and more enterprise CISOs are starting to understand this.
The model here is closer to what companies like Databricks already do inside highly regulated environments. It's not new... it's just becoming more structured and accessible to smaller vendors.
OlivOnTech 1 days ago [-]
I don't agree, I see supply chains attacks as a bigger risk than outdated systems exposed only in the lan.
alongub 1 days ago [-]
Both are real risks. But supply chain attacks exist whether you self-host or not... you're still running the vendor's code either way. The question is whether you also want that code to stay up to date and properly managed, or drift silently.
nickmonad 1 days ago [-]
I agree that keeping things up to date is a good practice, and it would be nice if enterprise CISOs would get on board with that. One challenge we've seen is that other aspects of the business don't want things to be updated automatically, in the same way a fully-managed SaaS would be. This is especially true if the product sits in a revenue generation stream. We deal with "customer XYZ is going to update to version 23 next Tuesday at 6pm eastern" all the time.
alongub 1 days ago [-]
This is true even with fully-managed SaaS though. There are always users who don't want the new UI, the changed workflow, the moved button. But the update mechanism isn't really the problem IMO, feature flags and gradual rollouts solve this much better than version pinning
nickmonad 1 days ago [-]
Sure. I'm just saying in the context where fully-managed SaaS was already decided not to be an option, and a customer is deploying vendor code in their environments, the update mechanism can in fact be a problem. It's not just poor CISO management.
mrhottakes 1 days ago [-]
agreed, this architecture is a non-starter for many enterprise orgs
This is very real.
I work with a deployment that operates in this fashion. Although unfortunately, we can't maintain _any_ connection back to our servers. Pull or push, doesn't matter.
The goal right now is to build out tooling to export logs and telemetry data from an environment, such that a customer could trigger that export on our request, or (ideally) as part of the support ticketing process. Then our engineers can analyze async. This can be a ton of data though, so we're trying to figure out what to compress and how. We also have the challenge of figuring out how to scrub logs of any potentially sensitive information. Even IDs, file names, etc that only matter to customers.
We're working on something for this! Stay tuned.
The people who know where to click and which dialog will pop up and when to click next are never going to agree to replace their non-automatable windows servers with fully automatable linux servers.
I mean, we're talking about a demographic that can't use ssh, never been on a platform using system package managers, and has little to no ability to version system changes.
They do all that manually.
This is fundamentally a data modeling problem. Currently computer telemetry data are just little bags of utf-8 bytes, or at best something like list<map<bytes, bytes>>. IMO this needs to change from the ground up. Logging libraries should emit structured data, conforming to a user supplied schema. Not some open-ended schema that tries to be everything to everyone. Then it's easy to solve both problems--each field is a typed column which can be compressed optimally, and marking a field as "safe" is something encoded in its type. So upon export, only the safe fields make it off the box, or out of the VPC, or whatever--note you can have a richer ACL structure than just "safe yes/no".
I applaud the industry for trying so hard for so long to make everything backwards compatible with the unstructured bytes base case, but I'm not sure that's ever really been the right north star.
Stream-of-bytes is classically difficult model to escape. Many have tried.
And to do it right (i.e. low-risk of of having it blow up with negative effects on the larger business goals), you need someone fairly experienced or maybe even specialized in that area. If you have that person, they are on the team because of their other skills, which you need more urgently.
SaaS, COTS, and open source monitoring tools have to cater to the existing customers. The sales pitch is "easy to integrate". So even they are not incentivized to build something new.
It boils down to the fact that stream-of-bytes is extremely well-understood, and almost always good enough. Infinitely flexible, low-ceremony, no patents, and comes preinstalled on everything (emitters and consumers). It's like HTTP in that way.
And the evolution is similar too. It'll always be stream-of-bytes, but you can emit in JSON or protobuf etc, if it's worth the cognitive overhead to do so. All the hyperscalers do this, even when the original emitter (web servers, etc) is just blindly spewing atrocious CLF/quirky-SSV text.
This is the crux of it. That's great until you encounter a need for a schema, and then it's "schema-on-read" or some similar abomination. And the need might not manifest until you're pushing like 1TB/day or more of telemetry data with hundreds or thousands of engineers working on some >1MLoC monstrosity. Hard to dig out of that hole.
The situation is tragically optimal--we've achieved some kind of multiobjective local maximum on a rock in the sewer at the bottom of a picturesque alpine valley and declared victory. We should do better.
Or maybe I'm overly optimistic.
But it's a very comfortable rock. pointy in all the right places.
Same VPS, same config, but under sustained load you’ll see latency creep or throughput drift depending on the host / routing / neighbors.
Short tests almost never show it — only shows up after a few minutes.
The metrics/logs part is also core to Alien... telemetry flows back to the vendor's control plane so you actually have visibility into what's running.
If something fails mid-update, it resumes from exactly where it stopped. You can also point a deployment to a previous release and it walks back. This catches and recovers from issues that something like Terraform would just leave in a broken state.
For on-prem: we're working on Kubernetes as a deployment target (e.g. bare metal OpenShift)
There are specific things where that's not possible, and there are bugs, but it doesn't seem like what you said unless you meant that you just support a limited subset of resources that are known to be robust to reverts? But that's a fairly different claim.
Alien tracks state at the individual API call level. A single resource creation might involve 5-10 API calls (create IAM role -> attach policy -> create function -> configure triggers -> set up DNS...). If it fails at step 7, it resumes from step 7. Terraform would retry the entire resource.
The other difference is that Alien runs continuously, not as a one-shot apply. It's a long-running control plane that watches the environment, detects drift, and reconciles. Terraform assumes you run it, it converges, and then nothing changes until you run it again.
At DollarDeploy we developing the platform to deploy apps to VMs with managed services provided, kind of like Vercel for your own servers. Would be interesting to try alien for enterprise customers.
https://github.com/alienplatform/alien/blob/main/crates/alie... :)
A different take: https://www.cloudron.io/
It is intended to be simple: - with the power of a mac mini, you can host (almost) anything - pay for the mini, it is your machine to do with as you please (we will host it for you) - if you decide you no longer need hosting, we will mail you back the machine that rightfully belongs to you
if anyone is interested in becoming a partner, shoot me a message, felipe@ind3x.games
- [0] https://www.minimahost.com/
The service provider has direct access to my infrastructure. It's one supply chain attack, one vulnerability, one missed code review away from data exfiltration or remote takeover.
Super cool product, I’ve gotta try it
"Written in Rust" seems to be a very popular thing to add.
My assumption is that people know it will get the thread more visibility?
Realistically, the game ends up being - see what you can get away with until someone notices. Given that, you might want to rename the product to something more boring than “Alien”.
More and more enterprise CISOs are starting to understand this.
The model here is closer to what companies like Databricks already do inside highly regulated environments. It's not new... it's just becoming more structured and accessible to smaller vendors.