Community.haskell.org is a server in our ecosystem that comparatively few know about these days. It actually was, to my knowledge, a key part of how the whole haskell.org community infrastructure got set up way back when. The sparse homepage still even says: "This server is run by a mysterious group of Haskell hackers who do not wish to be known as a Cabal, and is funded from money earned by haskell.org mentors in the Google Summer-of-Code programme." At a certain point after this server was created, it ceased to be run by a "mysterious group of Haskell hackers" and instead became managed officially by the haskell.org Committee that we know today. You can see the original announcement email in the archives.
The community server, first set up in 2007 played a key role back before the current set of cloud-based services we know today was around. It provided a shared host which could provide many of the services a software project needs -- VCS hosting, public webspace for documentation, issue trackers, mailing lists, and soforth.
Today, the server is somewhat of a relic of another time. People prefer to host projects in places like github, bitbucket, or darcs hub. Issue trackers likewise tend to be associated with these hosts, and there are other free, hosted issue trackers around as well. When folks want a mailing list, they tend to reach for google groups.
Meanwhile, managing a big box full of shell account has become a much more difficult, riskier proposition. Every shell account is a security vulnerability waiting to happen, and there are more adversarial "scriptkiddie" hackers than ever looking to claim new boxes to spam and otherwise operate from.
Managing a mailman installation is likewise more difficult. There are more spammers out there, with better methods, and unmaintained lists quickly can turn into ghost towns filled with pages of linkspam and nothing but. The same sad fate falls on unmaintained tracs.
As a whole, the internet is a more adversarial world for small, self-hosted services, especially those whose domain names have some "google juice". We think it would be good to, to the extent possible, get out of the business of doing this sort of hosting. And indeed, very few individuals tend to request accounts, since there are now so many nicer, better ways of getting the benefits that community.haskell.org once was rare in providing.
So what next? Well, we want to "end of life" most of community.haskell.org, but in as painless a way as possible. This means finding what few tracs, if any, are still active, and helping their owners migrate. Similarly for mailing lists. Of course we will find a way to continue to host their archives for historical purposes.
Similarly, we will attempt to keep source repositories accessible for historical purposes, but would very much like to encourage owners to move to more well supported code hosting. One purpose that, until recently, was hard to serve elsewhere was in hosting of private darcs repositories with shared access -- such as academics might use to collaborate on a work in project. However, that capability is now also provided on http://hub.darcs.net. At this point, we can't think of anything in this regard that is not better provided elsewhere -- but if you can, please let us know.
On webspace, it may be the case that a little more leniency is in order. For one, it is possible to provide restricted accounts that are able to control web-accessible files but have no other rights. For another, while many open source projects now host documentation through github pages or similar, and there are likewise many services for personal home pages, nonetheless it seems a nice thing to allow projects to host their resources on a system that is not under the control of a for-profit third party that, ultimately is responsible to its bottom line and not its users.
But all this is open for discussion! Community.haskell.org was put together to serve the open source community of Haskell developers, and its direction needs to be determined based on feedback regarding current needs. What do you think? What would you like to see continued to be provided? What do you feel is less important? Are there other good hosted services that should be mentioned as alternatives?
And, of course, are you interested in rolling up your sleeves to help with any of the changes discussed? This could mean simply helping out with sorting out the mailman and trac situation, inventorying the active elements and collaborating with their owners. Or, it could mean a more sustained technical involvement. Whatever you have to offer, we will likely have a use for it. As always, you can email email@example.com or hop on the #haskell-infrastructure freenode channel to get involved directly.
Much of the intermediate time was spent on cleanup of the introductory text, as well as adding features and tweaking designs. There was also a very length process of setting everything up to ensure that the "try it" feature could be deployed and well supported. Finally, we had to do all the work to ensure that both the wiki remained hosted well somewhere else with proper rewrite rules, and also that the non-wiki content hosted under haskell.org (things like /ghc, /platform, /hoogle, etc) continued to work. Some of the more publicly visible elements of this are the move of mailinglist hosting to http://mail.haskell.org and of the wiki content to http://wiki.haskell.org.
There were also a lot of things we missed, and which those threads pointed out -- in particular we didn't pay enough attention to the content of various pages. The "documentation" and "community" section needed lots of cleanup and expansion. Furthermore, the "downloads" page didn't point to the platform. Whoops! (It is to be noted that recommendation of the platform or not is an issue of some discussion now, for example on this active reddit thread. )
We think that we have fixed most of the issues raised. But there are still issues! The code sample for primes doesn't work in the repl. We have an active github ticket discussing the best way to fix this -- either change the sample, change the repl, or both. This is one of a number of issues under discussion on the github tracker.
We still want feedback, we still want patches, and there's still plenty to be made more awesome. For example: should we have some sort of integrated search? How? Where? The github infrastructure makes taking patches and tracking these issues easy. The repo now has 27 contributors, and we're look forward to more.
One balance that has been tricky to strike has been in making the site maximally useful for new users just looking to learn and explore Haskell, while also providing access to the wide range of resources and entry points that the old wiki did. One advantage that we have now is that the wiki is still around, and is still great. Freed from needing to also serve as a default haskell landing page, the wiki can hopefully grow to be an even better resource. It needs volunteers to chip in and curate the content. And ideally it needs a new front page that highlights the things that you _won't_ find on the main homepage, instead of the things you now can. One place we could use more help is in wiki admins, if anyone wants to volunteer. We need people to curate the news and events sections, to fight spam, to manage plugins and investigate adding new features, to update and clean up the stylesheets, and to manage new user accounts. If you're interested, please get in touch at firstname.lastname@example.org.
All the resources we have today are the result of our open-source culture, that believes in equal parts in education, innovation, and sharing. They stem from people who enjoy Haskell wanting to contribute back to others -- wanting to make more accessible the knowledge and tools they've struggled to master, and wanting to help us make even better and more powerful tools.
There's always more infrastructure work to be done, and there are always more ways to get involved. In a forthcoming blog, I'll write further about some of the other pressing issues we hope to tackle, and where interested parties can chip in.
This past week, the primary Haskell.org server hosting the wiki and mailing systems went down due to RAID failures in its underlying hardware. Thanks to some (real) luck and hard work, we've managed to recover and move off the server to a new system, avoided some faulty hardware, and hopefully got some improved reliability.
Before we started using Rackspace, eveyrthing existed on a single machine hosted by Hetzner, in Germany. This machine was named rock and ran several VMs with IPs allocated to them, which ran Hackage, Mediawiki (AKA www) and ghc.haskell.org on top of KVM.
This server had a RAID1 setup between two drives. These drives were partitioned to have a primary 1.7TB partition for most of the data that was striped, and another small partition also striped.
We had degredation on both partitions and essentially lost one drive completely. This is why the system had been so slow the past week and grinding to a halt so often.
We began to investigate this when the drives became unable to fairly service IO almost at all. We found something really really bad: we hadn't had one of the drives in nearly two weeks!
We had neglected to install SMART and RAID monitoring in our Nagios setup. An enormous and nearly disastrous blunder. And we didn't even look at our SFTP access for backups that Hetzner provided, but it was close to out of space last we checked. But we seemed to be OK in the read-only workload.
Overall, it was some amateur mistakes that almost cost us. We'd been so focused on moving things out, we didn't even take the time to check the server when we moved Hackage a few weeks prior and making sure the existing things kept working.
We'd planned to do this move earlier, but it obviously couldn't wait. So Davean spent most of his Tuesday migrating all the services over to Rackspace on a new VM, and we migrated the data to our shared MySQL server, and got the mail relay running again.
The whole process took close to 12 hours or so I'd say, but overall went quite smoothly without any kind of read errors or problems on the remaining drive, and restored service.
In the midst of anticipating this move, I migrated a lot of download data - specifically GHC and the Haskell Platform - to a new server, https://downloads.haskell.org. It's powered by a brand new CDN via Fastly, and provides the Platform, and GHC downloads and documentation. Don't worry: redirects for the main server are still in place.
We've also finally fixed a few bugs stopping us from deploying the new main website. We've deployed https://try.haskell.org now as an official instance, as well as fixed https://new-www.haskell.org to use it.
We've also moved most the MediaWiki data to our consolidated MariaDB server. However, because we were so hastily moving data, we didn't put the new wiki in the same DC as the database! As a result, we have somewhat of a performance loss. In the mean time, we've deployed Memcached to try and help offset this a bit. We'll be moving things again soon, but it will hopefully be faster and much less problematic.
We've also taken the time to archive a lot of our data from Rock onto DreamHost, who have joined us and given us free S3-compatible object storage! We'll be moving a lot of data in there soon.
We've got several improvements we're going to deploy in the near future:
And some other stuff I'll keep secret.
Overall, the past few weeks have seen several improvements but a lot of problems, and some horrible mistakes that almost cost us weeks of data if we hadn't been careful. Most of our code, repositories, configurations, and critical data was mirrored at the time - but we still got struck where we were vulnerable, on old hardware we had issues with before.
For that, I take responsibility on behalf of the administration team (as the de-facto lead, it seems) for the outage this past week, which cost us nearly a day of email and main site availability.
The past week hasn't been great, but here's to hoping the next one will be better. Upwards and onwards.
What you're reading is a blog post. Where is it from? It's from https://blog.haskell.org. What's it doing? It's cataloging the thoughts of the people who run Haskell.org.
That's right. This is our new adventure in communicating with you. We wanted some place to put more long-form posts, invite guests, and generally keep people up to date about general improvements (and faults) to our infrastructure, now and in the future. Twitter and a short-form status site aren't so great, and a non-collective mind of people posting scattered things on various sites/lists isn't super helpful for cataloging things.
So for an introduction post, we've really got a lot to talk about...
Haskell.org has had some rough times lately.
About a month and a half ago, we had an extended period of outage, roughly around the weekend of ICFP 2014. This was due to a really large amount of I/O getting backed up on our host machine, rock. rock is a single-tenant, bare-metal machine from Hetzner that we used to host several VMs that comprise the old server set; including the main website, the GHC Trac and git repositories, and Hackage. We alleviated a lot of the load by turning off the hackage server, and migrating one of the VMs to a new hosting provider.
Then, about a week and a half ago, we had another hackage outage that was a result of more meager concerns: disk space usage. Much to my chagrin, this was due in part to an absence of log rotation over the past year, which resulted in a hefty 15GB of text sitting around (in a single file, no less). Oops.
This caused a small bump on the road, which was that the hackage server had a slight error while committing some transactions in the database when it ran out of disk. We recovered from this (thanks to @duncan for the analysis), and restarted it. (We also had point-in-time backups, but in this case it was easier to fix than rollback the whole database).
But we've had several other availability issues beforehand too, including faulty RAM and inconsistent performance. So we're setting out to fix it. And in the process we figured, hey, they'd probably like to hear us babble about a lot of other stuff, too, because why not?
OK, so enough sad news about what happened. Now you're wondering what's going to happen. Most of these happening-things will be good, I hope.
There are a bunch of new things we've done over the past year or so for Haskell.org, so it's best to summarize them a bit. These aren't in any particular order; most of the things written here are pretty new and some are a bit older since the servers have started churning a bit. But I imagine many things will be new to y'all.
And it's a fancy one at that (powered by Phabricator). Like I said, we'll be posting news updates here that we think are applicable for the Haskell.org community at large - but most of the content will focus on the administrative side.
As I mentioned earlier this year pending the GHC 7.8 release, Rackspace has graciously donated resources towards Haskell.org for GHC, particularly for buildbots. We had at that time begun using Rackspace resources for hosting Haskell.org resources. Over the past year, we've done so more and more, to the point where we've decided to move all of Haskell.org. It became clear we could offer a lot higher reliability and greatly improved services for users, using these resources.
Jesse Noller was my contact point at Rackspace, and has set up Haskell.org for its 2nd year running with free Rackspace-powered machines, storage, and services. That's right: free (to a point, the USD value of which I won't disclose here). With this, we can provide more redundant services both technically and geographically, we can offer better performance, better features and management, etc. And we have their awesome Fanatical Support.
So far, things have been going pretty well. We've migrated several machines to Rackspace, including:
We're still moving more servers, including:
Many thanks to Rackspace. We owe them greatly.
We've done several overhauls of the way Haskell.org is managed, including security, our underlying service organization, and more.
While we're on the subject, here's an example of what the new Hackage Server will be sporting:
So, Hackage should hopefully be OK for a long time. And, the doc builder is now working again, and should hopefully stay that way too.
Like many other sites, Haskell.org is big, complicated, intimidating, and there are occasionally points where you find a Grue, and it eats you mercilessly.
As a result, automation is an important part of our setup, since it means if one of us is hit by a bus, people can conceivably still understand, maintain and continue to improve Haskell.org in the future. We don't want knowledge of the servers locked up in anyone's head.
In The Past, Long ago in a Galaxy unsurprisingly similar to this one at this very moment, Haskell.org did not really have any automation. At all, not even to create users. Some of Haskell.org still does not have automation. And even still, in fact, some parts of it are still a mystery to all, waiting to be discovered. That's obviously not a good thing.
Today, Haskell.org has two projects dedicated to automation purposes. These are:
We eventually hope to phase out Ansible in favor of Auron. While Auron is still very preliminary, several services have been ported over, and the setup does work on existing providers. Auron also is much more philosophically aligned with our desires for automation, including reproducibility, binary determinism, security features, and more.
In our quest to automate our tiny part of the internet, we've begun naturally writing a bit of code. What's the best thing to do with code? Open source it!
The new haskell-infra organization on GitHub hosts our code, including:
Most of our repositories are hosted on GitHub, and we use our Phabricator for code review and changes between ourselves. (We still accept GitHub pull requests though!) So it's pretty easy to contribute in whatever way you want.
We're very recently begun using CloudFlare for Haskell.org for DNS management, DDoS mitigation, and analytics. After a bit of deliberation, we decided that after moving off Hetzner we'd think about a 3rd party provider, as opposed to running our own servers.
We chose CloudFlare mostly because aside from a nice DNS management interface, and great features like AnyCast, we also get analytics and security features, including immediate SSL delivery. And, of course, we get a nice CDN on top for all HTTP content. The primary benefits from CloudFlare are the security and caching features (in that order, IMO). The DNS interface is still particularly useful however; the nameservers should be redundant, and CloudFlare acts more like a reverse proxy as changes are quick and instant.
But unfortunately while CloudFlare is great, it's only a web content proxy. That means certain endpoints which need things like SSH access can not (yet) be reliably proxied, which is one of the major downfalls. As a result, not all of Haskell.org will be magically DDoS/spam resistant, but a much bigger amount of it will be. But the bigger problem is: we have a lot of non-web content!
In particular, none of our Hackage server downloads for example can proxied: Hackage, like most package repositories, merely uses HTTP as a transport layer for packages. In theory you could use a binary protocol, but HTTP has a number of advantages (like firewalls being nice to it). Using a service like CloudFlare for such content is - at the least - a complete violation of the spirit of their service, and just a step beyond that a total violation of their ToS (Section 10). But hackage pushes a few TB a month in traffic - so we have to pay line-rate for that, by the bits. And also, Hackage can't have data usefully mirrored to CDN edges - all traffic has to hop through to the Rackspace DCs, meaning users suffer at the hands of latency and slower downloads.
But that's where Fastly came to the rescue. Fastly also recently stepped up to provide Haskell.org with an Open Source Software discount - meaning we get their awesome CDN for free, for custom services! Hooray!
Since Fastly is a dedicated CDN service, you can realistically proxy whatever you want with it, including our package downloads. With the help of a new friend of ours (@davean), we'll be moving Fastly in front of Hackage soon. Hopefully this just means your downloads and responsiveness will get faster, and we'll use less bandwidth. Everyone wins.
Finally, we're rolling out CloudFlare gradually to new servers to test them and make sure they're ready. In particular, we hope to not disturb any automation as a result of the switch (particularly to new SSL certificates), and also, we want to make sure we don't unfairly impact other people, such as Tor users (Tor/CloudFlare have a contentious relationship - lots of nasty traffic comes from Tor endpoints, but so does a ton of legitimate traffic). Let us know if anything goes wrong.
Server monitoring is a crucial part of managing a set of servers, and Haskell.org unfortunately was quite bad at it before. But not anymore! We've done a lot to try and increase things. Before my time, as far as I know, we pretty much only had some lame mrtg graphs of server metrics. But we really needed something more than that, because it's impossible to have modern infrastructure on that alone.
Enter DataDog. I played with their product last year, and I casually approached them and asked if they would provide an account for Haskell.org - and they did!
DD provides real-time analytics for servers, while providing a lot of custom integrations with services like MySQL, nginx, etc. We can monitor load, networking, and correlate this with things like database or webserver connection count. Events occur from all over Haskell.org On top of that, DD serves as a real-time dashboard for us to organize and comment on events as they happen.
But metrics aren't all we need. There are two real things we need: metrics (point-in-time data), and resource monitoring (logging, daemon watchdogs, resource checks, etc etc).
This is where Nagios comes in - we have it running and monitoring all our servers for daemons, heatlh checks, endpoint checks for connectivity, and more. Datadog helpfully plugins into Nagios, and reports events (including errors), as well as sending us weekly summaries of Nagios reports. This means we can helpfully use the Datadog dashboard as a consolidated piece of infrastructure for metrics and events.
As a result: Haskell.org is being monitored much more closely here on out we hope.
We've (very recently) also begun rolling out another part of the equation: log management. Log management is essential to tracking down big issues over time, and in the past several years, ElasticSearch has become incredibly popular. We have a new ElasticSearch instance, running along with Logstash, which several of our servers now report to (via the logstash-forwarder service, which is lightweight even on smaller servers). Kibana sits in front on a separate server for query management so we can watch the systems live.
Furthermore, our ElasticSearch deployment is, like the rest of our infrastructure, 100% encrypted - Kibana proxies backend ElasticSearch queries through HTTPS and over spiped. Servers dump messages into LogStash over SSL. I would have liked to use spiped for the LogStash connection as well, but SSL is unfortunately mandatory at this time (perhaps for the best).
We're slowly rolling out logstash-forwarder over our new machines, and tweaking our LogStash filters so they can get juicy information. Hopefully our log index will become a core tool in the future.
As I'm sure some of you might be aware, we now have a fancy new site, https://status.haskell.org, that we'll be using to post updates about the infrastructure, maintenance windows, and expected (or unexpected!) downtimes. And again, someone came to help us - https://status.io gave us this for free!
Rackspace also fully supports their backup agents which provide compressed, deduplicated backups for your servers. Our previous situation on Hetzner was a lot more limited in terms of storage and cost. Our backups are stored privately on Cloud Files - the same infrastructure that hosts our static content.
Of course, backup on Rackspace is only one level of redundancy. That's why we're thinking about trying to roll out Tarsnap soon too. But either way, our setup is far more reliable and robust and a lot of us are sleeping easier (our previous backups were space hungry and becoming difficult to maintain by hand.)
GHC has for a long time had an open infrastructure request: the ability to build patches users submit, and even patches we write, in order to ensure they do not cause machines to regress. Developers don't necessarily have access to every platform (cross compilers, Windows, some obscurely old Linux machine), so having infrastructure here is crucial.
We also needed more stringent code review. I (Austin) review most of the patches, but ideally we want more people reviewing lots of patches, submitting patches, and testing patches. And we really need ways to test all that - I can't be the bottleneck to test a foreign patch on every machine.
At the same time, we've also had a nightly build infrastructure, but our build infrastructure as often hobbled along with custom code running it (bad for maintenance), and the bots are not directed and off to the side - so it's easy to miss build reports from them.
Enter Harbormaster, our Phabricator-powered buildbot for continuous integration and patch submissions!
Harbormaster is a part of Phabricator, and it runs builds on all incoming patches and commits to GHC. How?
This has already lead to a rather large change in development for most GHC developers, and Phabricator is building our patches regularly now - yes, even committers use it!
Harbormaster will get more powerful in the future: our build plans will lease more resources, including Windows, Mac, and different varieties of Linux machines, and it will run more general build plans for cross compilers and other things. It's solved a real problem for us, and the latest infrastructure has been relatively reliable. In fact I just get lazy and submit diffs to GHC without testing them - I let the machines do it. Viva la code review!
(See the GHC wiki for more on our Phabricator process there's a lot written there for GHC developers.)
That's right, there's now documentation about the Haskell.org infrastructure, hosted on our new official wiki. And now you can report bugs through Maniphest to us. Both of these applications are powered by Phabricator, just like our blog.
In a previous life, Haskell.org used Request Tracker (RT) to do support management. Our old RT instance is still running, but it's filled with garbage old tickets, some spam, it has its own PostgreSQL instance alone for it (quite wasteful) and generally has not seen active use in years. We've decided to phase it out soon, and instead use our Phabricator instance to manage problems, tickets, and discussions. We've already started importing and rewriting new content into our wiki and modernizing things.
Hopefully these docs will help keep people up to date about the happenings here.
But also, our Phabricator installation has become an umbrella installation for several Haskell.org projects (even the Haskell.org Committee may try to use it for book-keeping). In addition, we've been taking the time to extend and contribute to Phab where possible to improve the experience for users.
In addition to that, we've also authored several Phab extensions:
Of course, we're not done. That would be silly. Maintaining Haskell.org and providing better services to the community is a real necessity for anything to work at all (and remember: computers are the worst).
We've got a lot further to go. Some sneak peaks...
And, of course, we'd appreciate all the help we can get!
This post was long. This is the ending. You probably won't read it. But we're done now! And I think that's all the time we have for today.
This is a brand new blog for Haskell.org, yoink!