Telemetry (WAS: Attempt at a real world benchmark)

Sat Dec 10 08:10:00 UTC 2016

Hi,

I’m mostly against any tracking.  For privacy reasons, but also what is the data going to tell?
Would I track timings, used extensions and ghc version, module size, per compiled module, per 
compiled project or per ghc invocation?

What are the reasons we believe the packages in hackage, or the more restrictive stackage set
are non-representative?  If we can agree that they are representative of the language and it’s
uses, analyzing the publicly available code should provide almost the identical results that
large scale compiler telemetry, no?

I have no idea about the pervasiveness of telemetry on windows. Nor do I know how much macOS
actually phones home, or all the applications that are shipped by default with it.

Two items I would like to note, that *do* phone home and are *out out*:

- homebrew[1] package manager that I assume quite a few people use (because it works
  rather well), see the Analytics.md[2], especially the opt-out section[3].
- cocoapods[4] (iOS/macOS library repository), which sends back statistics about package
  usage[5]

In both cases, I would say the community didn’t really appreciate the change but was unable
to change the maintainers/authors direction they were taking the tool into.

I think we should first need a consensus on what questions we would like to answer. And then
figure out which of these questions can only be answered properly by calling home from the
compiler.

I am still opposed to the idea of having a compiler call home, and would try to make sure that
my compiler does not (most likely by only using custom built compilers that have this
functionality surgically removed; which would end up being a continuous burden to keep up with),
so that I would not accidentally risk sending potentially sensitive data. In whole it would
undermine my trust in the compiler. 

cheers,
 moritz
— 
[1]: http://brew.sh/
[2]: https://github.com/Homebrew/brew/blob/master/docs/Analytics.md
[3]: https://github.com/Homebrew/brew/blob/master/docs/Analytics.md#opting-out
[4]: https://cocoapods.org/
[5]: http://blog.cocoapods.org/Stats/

> On Dec 10, 2016, at 1:34 PM, Manuel M T Chakravarty <chak at justtesting.org> wrote:
> 
>> Simon Peyton Jones via ghc-devs <ghc-devs at haskell.org>:
>> 
>> Just to say:
>>  
>> ·         Telemetry is a good topic
>> ·         It is clearly a delicate one as we’ve already seen from two widely differing reactions.  That’s why I have never seriously contemplated doing anything about it.
>> ·         I’m love a consensus to emerge on this, but I don’t have the bandwidth to drive it.
>>  
>> Incidentally, when I said “telemetry is common” I meant that almost every piece of software I run on my PC these days automatically checks for updates.  It no longer even asks me if I want to do that.. it just does it.  That’s telemetry right there: the supplier knows how many people are running each version of their software.
> 
> I think, it is important to notice that the expectations of users varies quite significantly from platform to platform. For example, macOS users on average expect more privacy protections than Windows users and Linux users expect more than macOS users. In particular, a lot of 3rd party software on macOS still asks whether you want to enable automatic update checks.
> 
> Moreover, while most people tolerate that end user GUI software performs some analytics, I am sure that most users of command line (and especially developer tools) would be very surprised to learn that it performs analytics.
> 
> Finally, once you gather analytics you need to have a privacy policy in many/most jurisdictions (certainly in EU and AU) these days, which explains what data is gathered, where it is stored, etc. This typically also involves statements about sharing that data. All quite easily covered by a software business, but hard to do in an open source project unless you limit access to the data to a few people. (Even if you ask users for permission to gather data, I am quite sure, you still need a privacy policy.)
> 
> Manuel
> 
> 
>> From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On Behalf Of MarLinn via ghc-devs
>> Sent: 09 December 2016 14:52
>> To: ghc-devs at haskell.org
>> Subject: Re: Telemetry (WAS: Attempt at a real world benchmark)
>>  
>> 
>> 
>> It could tell us which language features are most used. 
>> 
>> Language features are hard if they are not available in separate libs. If in libs, then IIRC debian is packaging those in separate packages, again you can use their package contest.
>> 
>> What in particular makes them hard? Sorry if this seems like a stupid question to you, I'm just not that knowledgeable yet. One reason I can think of would be that we would want attribution, i.e. did the developer turn on the extension himself, or is it just used in a lib or template – but that should be easy to solve with a source hash, right? That source hash itself might need a bit of thought though. Maybe it should not be a hash of a source file, but of the parse tree.
>> 
>> 
>> The big issue is (a) design and implementation effort, and (b) dealing with the privacy issues.  I think (b) used to be a big deal, but nowadays people mostly assume that their software is doing telemetry, so it feels more plausible.  But someone would need to work out whether it had to be opt-in or opt-out, and how to actually make it work in practice.
>> 
>> Privacy here is complete can of worms (keep in mind you are dealing with a lot of different law systems), I strongly suggest not to even think about it for a second. Your note "but nowadays people mostly assume that their software is doing telemetry" may perhaps be true in sick mobile apps world, but I guess is not true in the world of developing secure and security related applications for either server usage or embedded.
>>  
>> My first reaction to "nowadays people mostly assume that their software is doing telemetry" was to amend it with "* in the USA" in my mind. But yes, mobile is another place. Nowadays I do assume most software uses some sort of phone-home feature, but that's because it's on my To Do list of things to search for on first configuration. Note that I am using "phone home" instead of "telemetry" because some companies hide it in "check for updates" or mix it with some useless "account" stuff. Finding out where it's hidden and how much information they give about the details tells a lot about the developers, as does opt-in vs opt-out. Therefore it can be a reason to not choose a piece of software or even an ecosystem after a first try. (Let's say an operating system almost forces me to create an online account on installation. That not only tells me I might not want to use that operating system, it also sends a marketing message that the whole ecosystem is potentially toxic to my privacy because they live in a bubble where that appears to be acceptable.) So I do have that aversion even in non-security-related contexts.
>> 
>> I would say people are aware that telemetry exists, and developers in particular. I would also say developers are aware of the potential benefits, so they might be open to it. But what they care and worry about is what is reported and how they can control it. Software being Open Source is a huge factor in that, because they know that, at least in theory, they could vet the source. But the reaction might still be very mixed – see Mozilla Firefox.
>> 
>> My suggestion would be a solution that gives the developer the feeling of making the choices, and puts them in control. It should also be compatible with configuration management so that it can be integrated into company policies as easily as possible. Therefore my suggestions would be
>> 
>> ·      Opt-In. Nothing takes away the feeling of being in control more than perceived "hijacking" of a device with "spy ware". This also helps circumvent legal problems because the users or their employers now have the responsibility.
>> 
>> ·      The switches to turn it on or off should be in a configuration file. There should be several staged configuration files, one for a project, one for a user, one system-wide. This is for compatibility with configuration management. Configuration higher up the hierarchy override ones lower in the hierarchy, but they can't force telemetry to be on – at least not the sensitive kind.
>> 
>> ·      There should be several levels or a set of options that can be switched on or off individually, for fine-grained control. All should be very well documented. Once integrated and documented, they can never change without also changing the configuration flag that switches them on.
>> 
>> There still might be some backlash, but a careful approach like this could soothe the minds.
>> 
>> If you are worried that we might get too little data this way, here's another thought, leading back to performance data: The most benefit in that regard would come from projects that are built regularly, on different architectures, with sources that can be inspected and with an easy way to get diffs. In other words, projects that live on github and travis anyway. Their maintainers should be easy to convince to set that little switch to "on".
>> 
>>  
>> 
>> Regards,
>> MarLinn
>> 
>> _______________________________________________
>> ghc-devs mailing list
>> ghc-devs at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
> 
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs