[AMA] I'm a Statistician/Data Scientist, Ask Me Anything

Hi all,

I'm a statistician/data scientist (apparently the sexiest profession of the 21st century) currently working for a university in grains research. Previously I've worked as a statistician in food safety research, and as a software developer.

Been around ozbargain a while, but never done one of these.

Will do my best to answer so ask away!

P.S. Yes, I know 73.2% of statistics are made up. That's not my job, that's called a politician.

closed Comments

    • +3

      Interesting question! Mind if I ask what the charity is? PM me if you want.

      As far as I know, there aren't any statistician charities. There is a professional body (Statistical Society of Australia), and you may be able to find someone through that.

      • The charity is Lipoedema Australia.

        • +3

          i would be willing to help. data analyst here. degree in math and stats.

          • @[Deactivated]: That would be amazing. I have no idea how to message you directly? Can you message me?

        • Thanks. I assume you're meaning someone doing the analysis pro bono.

          Do you have data already, or would you need to go and run a/some studies? If the former, you may find someone willing to help out, but if the latter I would expect you'd have a much tougher time. It also depends how robust you want your analysis to be.

          If you haven't done anything yet, I would strongly recommend you contact a professional statistician (via SSA), and get help designing a study from the start. One of the most annoying things statisticians encounter is people turning up with data, and asking for help analysing it without getting help to design the study first, and then the statistician tells them it's not worth the paper it's written on. It's not because we're sadistic, anal or want to be hard to get on with, it's because there's a whole bunch of things to consider before you go and collect the data otherwise you run all sorts of risks with the wider application of the results of your study. Note that I'm not in any way implying that you would do this or have done this, but just a heads up so you can understand where we're coming from.

          Does that help?

          • @moar bargains: We don't have data currently, although I'm expecting it'll already exist e.g. number of knee reconstructions and the average cost etc.
            I'll try and loop in with darkzen15 above. If that doesn't work out I'll try the link above that you've provided.
            Thanks for the advice.

            • @Aureliia: Sorry, it sounds like I misunderstood the level you wanted to go to (assumed you were thinking a full clinical trial or analysis of one). Feel free to send me a message and we can discuss a bit more. If you click my username, there should be an envelope at the bottom that says "Start a conversation".

            • @Aureliia: Try MIP - The Data Company. I've been to one of their Sydney Alteryx User Groups and they do free analytics and insights consulting and even ongoing support and implementation for NFPs (as far as I know).

              https://mip.com.au/

    • +5

      There is meetup group - Data for Social Good - worth talking to them or arranging one session with group.

    • +1

      These guys do some work on data for charities: https://www.linkedin.com/company/civita-tech/

      • +1

        That's pretty cool! Thanks for pointing them out!

    • Try MIP - The Data Company. I've been to one of their Sydney Alteryx User Groups and they do free analytics and insights consulting and even ongoing support and implementation for NFPs (as far as I know).

      https://mip.com.au/

      • @Aureliia

  • +1

    What is your favourite pizza topping?

    • +1

      Hawaiian. Don't mind meat lovers too. And vanilla for a milkshake.

      I'm a boring guy. I am a statistician after all…

      • -5

        Hawaiian

        I'm sorry to hear that

        • Well, apparently there's another 6 Hawaiian pizza lovers out there!

  • +2

    All you need to know about statistics is that correlation does not imply causation

  • How much should we the general human race fear GMO crops (either for human or fodder consumption)?

    How often do you hear/know of Big Bad Business paying off/manipulating results to their benefit?

    Do you believe in overpopulation?

    • +1

      Interesting questions.

      How much should we the general human race fear GMO crops (either for human or fodder consumption)?

      I was actually at a conference last week where this came up. I am not even close to an expert on this (and don't even understand most of the technicalities myself), but I think it's like AI or many other new technologies. I think fear is misplaced, but I think there should be a healthy amount of caution, oversight and regulation. Last week South Australia relaxed restrictions around GMOs and in the grains industry where I work the reaction was overwhelmingly positive. It provides an enormous opportunity, because it will allow farmers to grow grain with higher yield and with less chemical usage which I think is a good thing. But I don't think it was purely from a "Great now I can make more money for less effort" point of view. My memory from the talk last week is that there are (3 I think, maybe more?) different "levels" of GM. The lowest one is actually quite wide-spread already, and the 2nd is increasing, especially in countries with less rigorous standards and controls (large Asian neighbours for example), and it's the 3rd one that causes concern. So I think we need to change the discussion from "should we" to "it's happening, how do we manage it". I dunno, still learning and thinking about this one.

      How often do you hear/know of Big Bad Business paying off/manipulating results to their benefit?

      In my role, very seldom. However, as I mentioned in another reply, misuse of statistics is a problem. For example, if a pharmaceutical company wants to show their new drug is better than the old one, it can be a matter of just taking enough samples. With enough data, it's easy to find a statistical difference and they know that. There are other regulations to get a drug to market than just one trial though, so I think the bigger issue is people who are under pressure to publish, and then do some dodgy things either knowingly or (more likely/common IMO) unknowingly. I think misuse and misunderstanding of statistics is also a problem in things like politics and news reporting, because suddenly you have things like "X doubles your risk of cancer!" or "Let's spend $Y billion taxpayer dollars on something that improves something for 5 people by 20% and 500 people by -0.1%". They are obviously cynical/facetious examples, but those sorts of examples are not hard to find.

      Do you believe in overpopulation?

      In the sense of "it's possible", or "it's going to happen", or "it's already happened"? I believe it's possible. Not sure about the other two options. That one's too big for me at the moment.

  • Cool AMA. Do you find yourself wanting to educate people when they recite statistics to you that actually don't really suggest what they think the data suggests?

    Eg 95% of shark attacks on humans happen near the beach - yes, that is where the humans are.

    90% of car accidents happen within 10km of one's home - yes, 90% of people's driving occurs close to one's home.

    Any others you've come across?

    And what's your perspective on the ubiquitous "gender wage gap"?

    • Thanks! :)

      Do you find yourself wanting to educate people when they recite statistics to you that actually don't really suggest what they think the data suggests?

      Yes, so much! As I mentioned above, it's very common in the media for example. "This thing is going to double your risk of cancer". But if your risk is 1 in 1 million, and it's now 2 in 1 million, is it really worth worrying about? And also things like "increase of up to 50%", or "average improvement of 20%" or things like that. Numbers like that (usually) have very little meaning without some context of variation. Does this increase for everyone by 20% (small standard deviation/low variation), or are most people going to increase by 2% and a very few will increase by 80% (high variation)?

      And what's your perspective on the ubiquitous "gender wage gap"?

      Interesting question. As far as I understand, it's illegal to pay two people doing the same work different amounts, regardless of gender and has been for some time. So the fact that the gender wage gap still "exists" must mean that either a) most of the countries employers are flouting the law and somehow all getting away with it, or b) it's due to factors other than gender alone. I think Jordan Peterson explained it well, despite the interruptions: https://www.youtube.com/watch?v=Xg2psply4no

      • +1

        I agree with "factors other than gender alone". I'm forever trying to explain to students building regression models that they "can only find the effects they look for", i.e. that the apparent narrative will change every time they include a new bunch of control variables.

        I find this particularly troublesome as this applies to models in psychology - if you enter age, gender, socioeconomic status, IQ, personality etc into a model, almost no novel predictor will uniquely account for variance in any outcome. Hence, researchers do not control for too much in their articles. It's a careful balance - control for enough to pass the "smell test" but not so much that your effect is not sig and your paper becomes unpublishable.

        TLDR most psychology research is bunk due to failure to control for relevant confounds and nuisance variables.

        • +2

          Yes, and unfortunately psychology gets a bad rap by statisticians I think because of the amount of rubbish that gets through, at the cost of missing the good stuff! There are some fascinating and really well designed psych studies, but unfortunately they seem to be the minority.

      • +1

        Yes the one about Nicholas cage movies being associated with drownings is classic - however I am not entirely convinced that the former is not causally related to the latter (sorry Nick).

        • +1

          Yes, I love them! The cheese consumption vs death by bedsheets is one of my favorites! It's a pity the haven't been updated in a while.

      • +1

        i think that's my new favourite thing

  • What are the chances of me getting laid tonight?

    • +3

      Sorry, no way for me to estimate the impossible 😉

      • +2

        Damn. You must know my wife

        • What can I say? I'm also a pro level stalker.

          (Hope your comment was a joke too because otherwise that sucks 😟)

        • we must have the same one….

  • how do you feel, when you know that the majority of us have an above average number of feet

    • 😱 Quick, get one of your feet removed!

  • how much do you get paid?

    • +1

      https://joboutlook.gov.au/Occupation?search=Career&code=2241…

      This is pretty accurate around conditions etc. I get paid a bit less than that, but I'm more than 10 years younger than the average listed there. But I do get 17% super which is awesome, especially for someone under 30!

      • wow $107K for someone under 30 not bad

        I disagree with the super tho, you still have almost 4 decades of waiting until you can access it.

        • No, no I'm a fair bit under $107k, although that's not including super.

          17% though, not sure I'll need to wait 4 decades to retire comfortably. Although I will need to wait to access it I guess.

  • I'm currently doing a statistical analysis of 700k leaked logins in the format of email:password

    So far I got password length distribution, popularity of putting a year of birth in email, next is evaluating the complexity of passwords.

    Can you think of any other cool stuff I could extract out of it?

    • +2

      Try popularity of passwords cursing because the system wouldn't accept any other elaborate passwords.

    • You could look at popularity of email providers?
      I guess the most interesting thing here is the standard 'popularity/reuse' of passwords. Can't think of anything else to suggest at the moment, but will let you know if I do.

      • +1

        Thanks,
        This is a list of the queries i already ran:
        - percentage of .gov.au emails, i'll drill down later per state / department
        - emails with 2 and 4 digit year in email ([email protected]) - I'll try to correlate the year with password choices later
        - percentage of users with same password as username
        - count of users ordered by length of password (9 is the most popular)
        - count by email domain (yahoo biggest by far)
        - count by email country code (.com .ru .net .uk are top leaders)

        • Nice work! 9 characters as length is interesting. Is that the minimum? Do they all have uppercase and/or special characters? Do you know where the data came from? Is there any patterns around the source of the data in either emails or passwords? E.g. [email protected]:password_for_site_x

  •  

  • Doing a masters of financial mathematics with a project component at UNSW at the moment. Do you think I have limited my prospects too much by doing this instead of master of stats? I have done a lot of stochastic analysis, time series, Bayesian inference and monte carlo methods in this course so there is some overlap but I'm worried i haven't done enough statistics to get into a data scientist/statistician role. I have also done a major in statistics in my undergraduate course if that helps.

    I also didn't take any programming units but do have basic understanding of it and will be using it for my project. Do you think i will need to take a short programming course to better improve my qualifications or will this be enough? I also haven't decided on what language I will use for my project but I'm deciding between C++ and python. Recommendations?

    • TBH it sounds like you'll be fine, but it really depends what the area is that you want to work in. You may not get many interviews for a clinical trial analysis role, but for a data scientist (especially in finance areas), you should be OK.

      Programming will be your biggest hurdle though - so much of stats and data science these days is programming. I'd strongly recommend you spend some time learning this. The more you can do and understand (around programming and computer science), the better off you'll be, but I completely understand there's lots to learn and limited time!

      I would also strongly recommend you use Python unless there's a very, very good reason to use C++. It's a much friendlier language, especially if you're new-ish to programming, and the only advantage C++ really has is speed. I can only think of two reasons why you would want to use C++ and they are a) your supervisor told you to, only knows how to use C++ and can help you, or b) you care about nanoseconds of run time.

      • +1

        awesome thanks for the response.

        I use VBA and SQL a lot at work but python and C++ is on a whole new level. I don't think I will have trouble picking it up, the only concern is without any offical programming courses under my belt, I'm not sure if I can show that I'm good enough at programming.

        Hopefully the project will be sufficient.

        • +1

          You're most welcome. What's the masters project on?

          Going from VBA and SQL to python is not a huge jump, I reckon you'll be ok :) Programming is something that is very much demonstrable, so you can prove you have the skills by showing you have the skills, and a masters project would be perfect. You could also supplement this with personal side projects depending on your time/interest. If you plan on doing some serious programming down the track, it would be worth doing an "Intro to computer science" sort of course, as you will learn some good practices that will serve you well.

          • @moar bargains: Still deciding on the project but currently gravitating towards the application of neural networks and deep learning in pricing a financial derivative.

            • @raz11: Nice! That sounds cool. Definitely go with Python for that.

  • What are some of your favorite tools of trade- or are you agnostic?

    Excel
    Minitab
    R
    Python
    SAS / SQL

    • I love R, and quite like python. Have never used Minitab. Never used SAS either, but about to start a project comparing it to some free alternatives.

      I use Excel a bit as well, but not for stats, just data management stuff.

      I'm pretty agnostic when it comes to tools, just depends what the requirement is. At the moment R is basically the only thing capable of doing the work I/my group does.

  • I'm in marketing. What's do you think the single best upskill would be for me in the way of data science? It's something I'm very interested in, although not very good at (I passed statistics at uni but it hurt my brain).

    At the moment I'm thinking about joining some data science and data visualisation clubs and just immersing myself in it but if you have a better vector into jumping into the world of statistics I'd love to hear.

    • I think the best thing you could do would be find some online courses around stats and see how you find them. I think one of the biggest problems with stats is that it's almost universally terribly taught :( which means many people think that "It's me - I'm not good at stats", rather than realising it's like learning a musical instrument, and it takes some practice to get comfortable.

      Pair that with some practical exercises (otherwise you'll go crazy), and see if you can get started with doing some interesting things. Kaggle is a good resource for just having a go at some interesting data problems. Finding something you're interested in can really help too, and there are a ton of data sets on Kaggle as well as other areas.

      Does that help?

  • What's your income and hours?

  • Thank you for the AMA and great career choice! I have a couple of questions:
    1. Why have you chosen grains research?
    2. How challenging is it to obtain research funding in your field?
    3. What's your opinion monoculture?

    • Thanks :)

      1. Sort of accidental really. First job out of uni was in food safety research, which was a good job but short term contract. I met my current boss during that time. A few years later when I was working in a different role (software developer in a government department) my current job came up. I've never really had a specific focus or goal of what I wanted to do (hence the stats degree because it's so broadly applicable), but enjoy my current job.
      2. Fortunately I don't have to deal much with this, as I'm just a member of a group who are all funded under one grant. I'm pretty junior, so don't have to deal with this stuff.
      3. Not really sure what you mean by this, but assume you mean planting the same crop year after year in the same area? Most of the growers I encounter don't do this, so I guess it's not ideal practice? I'm developing opinions about a lot of things I have never thought about until the last couple of days ;)
  • +1

    What's your favorite xkcd?

    mine

  • What's the solution to the problem posed by discrete quantities of data?

    • 42.

      • Isn't 42 a quantity?

        • I don't understand what you mean in your original question

          • @moar bargains: Is not mathematics the study of quantities? We study quantities because they are problems are they not? And as a data scientist you digitise these problematic quantities so that we can then view them in a nice chart or graph to track their progress or change.

            But what I want to know is whether or not there is a solution to the problem of countable quantities.

            I've seen the charts on the progress of carbon emissions over the decades but what is the solution to countable quantities of carbon emissions?

            When you graph the carbon emissions data what are you measuring the damaging quantity progressing relative to what utopian value that would solve the problem posed by a quantity?

  • What is the solution to the electricity, gas and water puzzle?

  • What are your thoughts/experiences regarding actuaries? Do you think they have a skillset to work in data science?

    Thanks!

    • I think they've got as much chance (maybe more) as other numerical professions. As I mentioned on one of the other comments the biggest problem will probably be the lack of programming skills/experience in some of the frameworks.

  • As a Statistician/Data Scientist, are you good with Financial modelling??

    • Nooo… I can do my budget and taxes at home, but I wouldn't want to touch anything like modelling stock market data without spending a fair bit of time learning.

  • What is the standard deviation of time spent browsing Ozbargain?

    • 42.63 minutes.

      • +1

        Confidence level 85% thank you

  • What is your recommendation for freshly graduated?

    • +2

      There seem to be plenty of jobs around for people with these sorts of skills, but (add usual) the issue is many of them want experienced people. In terms of how to go about this, that's not really specific to this field.

      One suggestion if you have a goal in mind (e.g. getting a job as a statistician/data scientist). If you have the time and ability, join the local branch of the professional stats groups (stats society, biometrics society, institute of analytics professionals, etc or pick the one most relevant). Then go to meetings. Get involved, and get your name and face out there, so that when jobs come up with these people, they will know your name, and so you already have an advantage.

      Good luck!

  • Which university you graduated from and if you can choose again would you still choose Statistic?

    • I graduated from the University of Adelaide, Bachelor of Mathematical and Computer sciences with a major in statistics. Competed Honours in statistics while working at my current job.

      If I could go back I might do some things a little differently, but overall I'm happy with where I am at the moment.

  • Can economists really predict the past 50% of the time?

    • +1

      I've heard it said that economists were invented to make weather forecasters look good.

      That might be a little unfair. Economists have after all predicted 9 of the last 5 recessions.

  • Have you looked at Stats & Records - OzBargain Wiki. What statistics would be interesting to data visualise or use sparklines? Any comments to make it better?

    A related question: any comments (preferably there) on Data Visualization of Badges Growth over Time - OzBargain Forums post? Another pseudononymous data scientist said there a better indicators. What would that be?

    • Have you looked at Stats & Records - OzBargain Wiki. What statistics would be interesting to data visualise or use sparklines? Any comments to make it better?

      I've had a browse from time to time. I can't think of much to improve it at the moment. I imagine the main things of interest on a website like this are things like number of visitors, length of visit, do they click links/ads, interactions on the forums, things like that. I'm sure Scotty and team are well across all that stuff.

      A related question: any comments (preferably there) on Data Visualization of Badges Growth over Time - OzBargain Forums post?

      That would be mildly interesting, but I think what Datascientist1 was getting at is that the other metrics like number of visitors and logins would answer a similar question more easily. Have a go yourself though if you want and see what you can find!

  • +1

    What software do you recommend to make presentation-ready graphs? Honestly MiniTab and MATLAB graphs are worst aesthetically.

    • +1

      I use ggplot2 or plot.ly usually. Depends a little on how you're planning to present (powerpoint, web/blog, technical paper, journal submission etc)

  • +1

    Do you build tools for your individual use/daily life? Does it help you in your ozbargain/deals hunt?

    • +1

      Great question! Yes I do, because I'm a nerd ;)

      I built a webapp where I could monitor my historic solar production. I also analyse my bank transactions occasionally to update my home budget. There's a couple of others I've done too, but nothing really to help with bargain hunts other than a couple of IFTTT recipes. I have a bunch on my to-do list as well, just gotta find a time to get around to them :)

      • That's sweet! Any particular API that you use a lot?

        I'm learning python/pandas and trying to do some projects outside of work (not really relevant to my current role). Just that I can't find much practical use of it

        • +1

          I personally use R and Shiny. I have also been using googlesheets API and Mongo DB lately to host some data, so that's been interesting to play around with.

          Have you had a look on Kaggle? Lots of data there. Another one is data.gov.au. You could look at something like the Ambulance Victoria data for example. Were there specific suburbs that had high wait times, and large numbers of calls? What is the average proximity to an ambulance station? Things like that. I would like to do more of these sort of side projects, but half the time my issue is coming up with questions that are interesting that I can answer (non-trivially) with the data.

          • @moar bargains: Hmm Googlesheets API would be interesting, I only know google apps script, and you need javascript to do that

            I have done some basic level of the titanic kaggle. Learned a lot by doing it, but unfortunately it isn't practical and I don't know what to do with it after. Hence thr first question :)

            Out of curiousity, did you know/plan to be a data scientist when you chose your uni degree?

            Thanks a lot for answering and doing this AMA!

            • @Ceri: You're most welcome! Thanks for your questions :)

              When I finished high school I didn't know what I was going to do, so took a gap year. In that time I remembered that I did ok at stats in HS and it kinda made sense to me. So I basically did enter uni intending to be a statistician (I didn't hear the term data scientist until after I graduated). I'm a little bit strange :)

              The best thing you can do is to find a problem in your daily life and attack it. What are your interests? Do you like going to the gym and wear a fitbit? Write an app to import and analyse that data. Do you like reading? See if you can write an app that will interface with a google maps API and plot all the cafes/restaurants you've eaten at. Things like that :) I had an old solar system that didn't have a web api, so I went and recorded the solar production every day into a google sheet on my phone. Then I wrote an app to pull that sheet in and produce a plot.

              Hope that helps!

Login or Join to leave a comment