Dark night in Tweetland
I know last week’s authentication problems frustrated a lot of people. (One of them was me.) At the time we initially limited ourselves to the usual bland ‘We know there are issues and we’re looking into it’ on Twitter: it takes time to communicate details, it provokes kibitzing, it can reveal details about our setup that I don’t necessarily want to, and if I’m doing something stupid, I don’t necessarily want people to know. But when we went into a bit more detail, people seemed to like it. So for the hell of it, here’s an abridged account of what went on over the last four days: a peek into the engine room.
1. Get reports that some users are having problems authenticating.
2. Get enough reports that I become convinced that it’s not just the usual occasional Twitter hiccups.
3. Try to reproduce the issues with several test accounts. No dice.
4. RDP into our servers in the US and see if I can reproduce from there, in case it’s my location that’s unaffected. Nope.
5. Scour the Web, api.twitter.com, the Twitter stream and so forth for references to similar problems. No luck.
6. Still can’t reproduce. Lots of people complaining now. Problem seems to be becoming more widespread.
7. Realise the functionality that normally emails error messages to me has been down for about 72 hours. Oops. Footle about with it for a quarter-hour before giving up for now (the logs still work).
8. Add some diagnostics to see what’s failing. There’s a JSON deserialization error in Tweetsharp (the Twitter API library we use) that I don’t understand.
9. Download the latest version of Tweetsharp since we’re about two versions behind. Run smoke tests and do an emergency deploy. This is a shot in the dark, but has fixed issues twice before when we hit Twitter problems. (What happened on those previous occasions was, Twitter trailed their changes, the Tweetsharp committers followed through, then Twitter made the changes once everyone had caught up…but we hadn’t caught up.) No luck.
10. Discover my wife can reproduce. er, that is, my wife can recreate the authentication problem.
11. Get sidetracked by a hypothesis that the problem only affects users with spaces in their display name.
12. Dig into root cause using wife’s account. Fail to get anywhere. It’s failing for my test accounts too, now, though, which is… sort of reassuring.
13. Talk to another Tweetsharp-using dev with similar problems who thought he was going mad. Start a thread on the Tweetsharp site. Tweetsharp co-ordinator Jason Diller responds almost immediately.
14. Doh! moment when I realise actually one of the threads on Twitter devtalk *is* referring to this problem. Twitter accidentally introduced a change – the user/status call is returning an additional user object with just the ID, inside the status tag which is inside the main user object. This is why the deserialization error made no sense to me – it’s this tiny overlooked user object it can’t deserialise, not the main one.
16. So Twitter have acknowledged it as a bug, we’re not just going mad, which is nice.
17. But doesn’t do us any immediate good.
18. Jason the Tweetsharp guy checks in a workaround to their repo, just two hours after we first reported the problem. Bloody hell, go Tweetsharp.
19. But by this point Twitter are saying the problem is fixed and we just need to wait for their cache to clear.
20. And [specific problem excised for reasons of extreme dullness] means I can’t build the trunk version of Tweetsharp anyway.
21. Did I mention that we’re now around 11pm, my time (BST)?
22. I tinker a bit more then go to bed in the fond belief that Twitter will have made everything good by morning.
23. Overnight, various US timezone types keep being unable to auth and indicating sorrow.
25. Various people are saying the original issue is not fixed yet on the Twitter thread. Twitter says, Real Soon Now.
26. I add a quick and ugly patch (a regex that takes the extraneous user object out of the returned JSON).
27. Which works.
28. But a few people are still complaining they can’t tweet content.
29. Twitter assures everyone on devtalk that the original issue really is fixed.
30. There are still problems with tweeting content. This is a bigger deal than it might immediately seem. (i) it’s part of the implicit contract with players that we give actions for content tweeting – if we’re at risk of having players tweet content and not rewarding them with actions, we take that seriously. Which doesn’t seem to be happening, but (ii) the lack of tweets is already having a visible effect on our growth numbers.
31. Guess what? I can’t reproduce the tweet problem.
32. I follow what look like relevant logged errors down into the Tweetsharp source code, can’t understand the error I’m getting, can’t reproduce locally or live.
33. Maybe it’s my dodgy regex hack fix? I don’t see how but it seemed to start about the same time, and I didn’t like the hack…
34. Check devtalk. Original issue really is really fixed, Twitter says.
35. I remove the regex hack from the live site and wait to see if people are still reporting problems.
36. I find out very quickly that the original issue is not in fact fixed.
37. I find out after a delay that the tweeting issue is not fixed by removing the hack.
39. Put the regex hack back.
40. Go off and work on the migration to the new servers for a bit, hoping I’ll have an idea about the tweeting issue by the time I come back. One of the (brand new) servers suddenly becomes unresponsive. Have a long argument with my hosting provider.
41. Take a few hours off. It’s Saturday by now, dammit.
42. Add some more logging code to the site that actually tells me what the content was that someone failed to tweet. Leave it overnight. Ask players reporting bugs for more specific info (the key bit is, which bit of content did you fail with?)
43. Thank Christ, there do seem to be specific bits of content that cause the functionality to break. Further investigation shows anything with a carriage return or a tab character left in (they were pasted in from various text editing tools) is breaking. But they’ve worked for months!
44. Hypothesis: the newest version of Tweetsharp, which involved some underlying rewrites, now breaks with these whitespace characters (not a problem for most because who’d put a tab in a tweet?)
45. Run a db script that strips out all the tabs and carriage returns.
46. Problem solved.
47. Have a double espresso.
48. Change daughter’s nappy.
ONE! intermittent bugs are a bitch. OK, I knew that already. But I should have enlisted the aid of my lovely players in getting more details sooner.
TWO! Deploying newer versions of key libraries is dancing for rain at the best of times, and sometimes it means you get struck by lightning. Even if it seems like a good idea at half eight in the evening.
THREE! Add logging code sooner, not later.
FOUR! Open source projects with active co-ordinators are good.
FIVE! which I knew – detailed bug reports are very, very useful. If you’ve sent one (especially if you included exact time with timezone of problem, browser in use, current user name) consider yourself the recipient of a big beaming smile.
 and would thoroughly recommend for all you people out there in C# land