Interviews/Rob Halsell

Interview with Rob Halsell from Wikimedia January 11, 2010
Background & Prior experience before Wikimedia, roles in Wikimedia

Always been doing computing in IT – as either network administrator or file administrator for various corporations
Worked in data centers at large corporations
Started at Wiki in December 2006
Responsible for site operations management of the primary data centers – handles the vendor relationships and purchasing in Florida facility, also does system administration and software configuration
Also does some IT tech support for the office

For site operations and data centers, what are the primary strengths of the Wikimedia systems and primary weakness?

Weakness – primary datacenter in Tampa – hurricane damage with internet issues for months after the hurricane. Need a multi-hone approach to data centers – that is the biggest issue
Strengths – Fact that have has been able to do a whole lot with a very small footprint (in one building with small budget) – But still need another data center.

What leads to Wikimedia’s success with a small footprint?

People they have are very competent and intelligent – they have done this for other large scale companies or are volunteers working at large scale companies

Site reliability – Frequent outages and room for human error – can you shed a little light the reason for the site outages?

Don’t have a fully developed sandbox where we can't test software – we once pushed a software change and the load was exponential on the backend

What would be the solution to enable to test the software in a safe environment?

Need to be more developed sandbox environment – not sure what the programmers need

From operations piece of it – Heard people say they need a sandbox – What would be involved in developing a sandbox?

Already setting up a sandbox – load balancing – making steps to do that
There is an easy short term fix that will be done within the first two months
Trying to get a development sandbox set up – doesn’t need to be on the agenda for like eight months
Think it is something that can be solved soon. – there are 18 people involved in developing it – strong consensus that this is something that Wiki should do and first step is not that difficult to be put into place.
Will have the first steps of the hardware done in the next couple of months and then the long term goal will be to refine the software

We have been hearing that the one of the big 5 year operations goal would be a data server that is not in Tampa

In the next year, we need one data center outside of Tampa
In the next five years, we need multiple data centers – no other top ten sites have only one data center

What happened to the server in Korea?

Yahoo donated half a rack of servers to Wiki – it was not a donation to keep – Yahoo would pay for everything – speeds up access to Wikipedia in that area of the world
There was no paperwork for this – could never find anything to prove the conditions – there had been a lot of turnover and no record of anything – the server started to break
We turned off the servers and left them alone

Can Wiki use an outside party for the data centers?

The big two or three primary data centers need to be directly managed by Wiki or by contracted carriers, you don’t want to be trapped into someone’s bandwidth restrictions
Beyond the primary two data centers – there is no issue to take any donations

What do you have in the data centers?

Three big groups – database servers, squid servers, apache servers
- Squid servers – caching the information on the caching servers
- Web servers give the caching servers updates
- Database servers hold all the actual current information on Wikipedia, Caching servers hold the current snapshots, Web servers link to the outside world
- Yahoo servers in Korea were Caching servers

Capacity – what we have heard is that current capacity is sufficient in terms of handling the text based data but as more media and video streaming comes along, capacity is not sufficient. Could you comment about that?

Text in Wikipedia is still the majority of the article
Text takes up next to no room compared to image data – could go and take whatever image servers are too small and they can hold text data
Don’t know a whole lot about of that – that is more development
Storage is always going to be an issue. If people want to upload video – storage space is always going to be a concern

If video is going to become large part of Wikipedia – how many need to increase capacity?

Don’t know scale very well – I have nothing to do with it – that’s all development

System to vote or chat…What would be the operational impact?

Messaging would add a little bit of site overhead – not like facebook where people are always on the site – would make it so new people have an easy integration – can’t imaging it would be some staggering increase…
Need a different server added in to update messaging client – increase through Apache pool or new server brought in to handle that
Not really the best person to ask for this.

Is there anything that you can imagine that Wikimedia would choose to do that would dramatically increase the demands on capacity?

If went to turn on no limitations to video uploads – that may be a nightmare but otherwise not sure of anything

One to five year strategic priorities – getting a couple of new data centers, in kind donations, etc. – is there anything else separate from the data centers that would be a top of mind priority

Cold backup – some kind of offsite, automated, remote option to backup all of the data. Store it with a third party solution
Security precaution to prevent data loss.

Can't you talk about what it would be like to reconstruct the data if the building in Florida that houses the data centers was destroyed

Much more painful process of getting it all in place.
Also if someone internally decided to wipe everything out then we have to hope people’s own backup work because don’t have a cold backup solution.

If someone decides to sabotage…

If someone tried to sabotage the data, they could do a lot of damage – they could do insufferable amount of damage and a lot of damage recovery
It is about having this cold backup – data is somewhere (like with Iron Mountain) – can drop off data, they come pick it up – can’t get the data from them – high security – two of the four have to be there to get data. Can call and lock someone out in 3 minutes or less

Comments on Powerpoint slide:

Site in multiple places is the actual infrastructure of the site
Site reliability vs. site infrastructure is a false distinction
Backups of site and office data (just text is incorrect – has to be everything)
Under site reliability --- Limited standardization – sandbox falls directly under that bullet point
Lack of redundancy… getting at the need for the other data center and the lack of cold backup
RELIABILITY – falls under both: 1. Need for cold backup storage. 2. Need for live second data center
Capacity – Correct to say that current media upload is low
Capacity - A significant increase in participation, or Web 2.0 features such as chat, or real-time collaboration would dramatically increase demands on servers – Not the best person to ask about it …
Opinion – No other comments…
Cold backup & multi-site backup are the two big things that feel are important