[AMA] I'm a Statistician/Data Scientist, Ask Me Anything

Hi all,

I'm a statistician/data scientist (apparently the sexiest profession of the 21st century) currently working for a university in grains research. Previously I've worked as a statistician in food safety research, and as a software developer.

Been around ozbargain a while, but never done one of these.

Will do my best to answer so ask away!

P.S. Yes, I know 73.2% of statistics are made up. That's not my job, that's called a politician.

closed Comments

    • +1

      Speaking of deals, what about scanning the food giftcards (e.g. GoodFood, Gourmet Traveller), Entertainment Book, TheFork, EatClub, Liven, First Table, etc with cashback deals and discounts (e.g. GoodFood currently has up to 20% discount) and then sort them to a particular set of criteria based on your (future) location. Even better, scrape the menus so you can filter on what you feel like eating for dinner that day. Then add ratings from Zomato, TheFork, Google, Facebook…

      • Yeah, great ideas! I've got a few myself, but just need the time to do them!

        • What's your proportion of engineery vs data sciencey tasks? For the above task, for example, I would imagine the bulk of the work would be on the gathering of data, connecting to APIs, making the code robust so that you would be able to add a new service easily. Should I be focusing more on the engineering or learn more on statistics, or does it depend?

          • @Mactuary: I'd say very much depends on your end goal. If you're wanting to build an app that millions of people use, then probably engineering is a better focus. If you are doing it for personal interest or to land a different/better job then do whatever you enjoy most! :)

  • This is about feature engineering. Seems like the articles I read about contain amazingly creative ways of extracting something very non-obvious from the data. Do you have a particular method to doing this that has worked well?

    I'm working on a project that has student ratings and comments (free text) as well as other structured variables such as meeting duration, student and teacher location, device used (computer/mobile) for the meeting, etc. I haven't got my "light-bulb" moment? I can only see the using the free text comments to extract some more data, e.g. sentiment, topics, length of text, use of emojis. Other than that, at this rate I'm just going to clean it up and put in several ML algorithms and see what works well.

    Any suggestions?

    • No, unfortunately I'm not really one of those very creative types :( And to be fair, most people aren't even among data scientists and statisticians. It's the same as any other industry - you don't hear about the average web developers, just the ones that make millions from making a social network.

      Hm… yeah you could have a look at sentiment analysis. I guess the key thing is what question are you trying to answer with this data? Estimate student engagement with a course (or courses)? Some other ideas (without full understanding of the data): Does it change with device type used? Can you identify students (anonymously I would assume)? Can you track/match/correlate their answers across time or courses? Does teacher make a difference to course engagement? If it's a big dataset with enough free text: can you match students based on their writing style? Could you classify them somehow using this?

      Hope some of that helps :)

      • Thanks for the answers.

        Those have crossed my mind and I'm definitely going to pursue it. Writing style is an interesting one. Though maybe the texts aren't long enough for that.

        Students are identifiable (anonymously as you assumed correctly), so I can track them across different sessions/time. Maybe I can get sufficient amount of text per student!

        • Yeah having encountered a few student feedback forms, text length can be a problem! Perhaps you could try and rank their literacy level by collating their answers? Will let you know if I have any light-bulbs too.

  • Speaking of Kaggle, what do you think about Kaggle? I've heard a point of view that Kaggle is not really anything like real life, because some of the hardest tasks in data science have already been served on a plate in Kaggle. They were specifically talking about
    1. getting the data
    2. choosing an appropriate metric

    Have you done a Kaggle competition? Is doing a half-decent job in a Kaggle competition much positively regarded by recruiters?

    • +1

      I've never done a competition, but have had a look at a couple of datasets on there. I can see how that point of view could be taken. I think it's a worthwhile training/learning tool, but I think potential employers would only take note if you've done something new/novel/noteworthy. If you've just copied code that's there and got it to run, then it's no better than an assignment from Uni (possibly even worse).

      So what you really want to do is use the data to learn some skills, then take those skills and apply them to another problem someone else hasn't done already (or an area someone hasn't looked at).

  • I've done stats a lot at uni and it was really boring and forgettable. Do you know how to make stats fun? I don't need a lot of stats, just a little but to get by. Is there a single stats pop science book that will do it.

    • The problem with stats is that it's so broad. It's like medicine, in that there are a million specialities/fields you can get into. What field are you in?
      I will have a think and talk to some colleagues tomorrow, because I remember one mentioning a youtube channel I think? Please remind me if I forget to get back to you :)

  • One last question from me for tonight. Do you use any software like SRS (spaced repetition software) like Anki, SuperMemo, etc to help you remember things like R commands or statistical formulas? I still can't remember how to a simple regex or plot histograms off the top of my head even though I've done it so many times before (thanks Google). Granted these are not my bread and butter at my current work. Do you have a specific method to memorise these things?

    • Not really to be honest. I just remember the stuff I need to use regularly, and the rest I Google. I can remember a lot, but many of the obscure functions and arguments I still Google.

      This was one of the things that irked me the most about my degree. Why did I have to memorise all these equations and proofs and reproduce them under exam conditions? In my job I just Google them, or look in a textbook if its really obscure.

      Thanks for your questions :) hope my answers have been helpful, and feel free to ask more if you have them.

  • How do I turn all my if/else statements into 'Machine Learning'and 'AI'?

    • Magic :)

      No idea to be honest! It may require a different approach i.e. starting with your end categories and creating datasets from there that you use to train a model.

  • Are you normal?

    • Certainly not :) I think I was dropped as a baby, that's why I got into stats. But they've basically fixed the twitching. I usually don't even twitch now unless I get asked a really dumb question. Ah, the wonders of modern medicine! :)

      • Where are you on the normal distribution? And what is 'normalisation' as i'm pretty sure it's a thing?

        One day at work a colleague referred to the tail on a pig when talking about something unlikely. But he was talking about a rare event, so i corrected him and said it was the hair on the end of the tail on the end of the pig. He lol'd which i had never seen him do before - another rare event. You can use my joke at work no problems.

        • Apparently I'm almost mean. Slightly taller than mean, slightly lighter than mean, etc.

          Normalisation in statistics is where you basically rescale things to fit a normal distribution (usually). Take the mean off the value and divide by standard deviation. https://en.wikipedia.org/wiki/Normalization_(statistics). This is often used in things like educational assessments where you expect students to follow a normal distribution within the class.

          I've not heard hair on a pigs tail - usually I've heard of unlikely events referred to as "black swans", but I think that's especially around negative outcomes.

          • +1

            @moar bargains: I used to know all that, but studying stats in not like riding a bike.

            The piggy reference was an original.

            Black Swan.

  • I know this is a basic question for you but taken everything into account which model is better for predicting stock movements; a logit or probit model?

    • Not a basic question at all! As I've mentioned in other comments above, finance and economics is not an area that I've done anything professional in, so can't really give any advice. From my quick Google though one guy makes the claim that probit models are now common in economics, so I guess that's the one I'd suggest?

  • +1

    Thank you for this AMA OP! I was wondering if you could please recommend some basic books or resources for the following:

    1. Statistics, probabilities and p-values in terms of experimental design

    2. Basics of machine learning

    In terms of background, I'm a biochemist by training but I don't use statistics much in my current research. I'd like to better understand the use of statistics and p-values in research as I attend a lot of seminars where this kind of data is presented. I know that statistics underpins machine learning but I'd love to be able to get a basic understanding of how it works. I guess entry level recommendations would be best? Thank you!

    • +1

      You're most welcome! Thanks for taking the time to ask a question :)

      I'm out of the office today, but will ask my colleagues if they have any suggestions of books or resources tomorrow. I think I have heard some youtube channels mentioned.

      In the mean time, you could look at Experimental Design for Biologists by David J. Glass (ISBN 9781621820413). And this might be a useful resource for getting a start on understanding Machine learning?

      Hope that helps!

      • +1

        Brilliant! Thanks very much for these recommendations, I'll definitely look into them. Any others from your colleagues would be most welcome. Much appreciated!

  • +1

    Hi, thanks for answering in advance. What’s your experience in the field and salary banding? (Ballpark will do) and which state are you in for comparison?

    • I'm about 6 years out of uni, so still relatively junior I guess. Have worked in food safety statistics research, government departments as a data analyst and software developer and now as a statistician for grains research. Based in Adelaide and earn between $70k and $80k plus 17% super.

      • Adelaide pay rates are rubbish though :)

        • +1

          Yeah could be better, but house prices make up for it IMO ;)

  • Hi all,
    Just wanted to take a moment and say a genuine thank you to all those who have asked questions or commented. I'm humbled by the amount of interest in lil' ole me and my job. There's honestly been far more than I ever expected!

    So thanks, and feel free to keep the questions coming :)

    • +2

      This may sound like a joke but a genuine question. There is this old bloke in our office shouting "all statistics proving global warming are fake and global warming itself is a hoax" etc. He reckons scientists make up stories out of thin air and statistics are also made up, hence they had to change the name from global warming to climate change now. Based on what you know, is he right? Apologies if this is irrelevant topic. Thanks in advanced.

      • +3

        Haha very interesting question. I honestly don't really know where I sit at the moment on this one. Here are my thoughts.

        There are a fair number of highly intelligent scientists who are absolutely convinced by climate change and it's effects. I can't do all the analysis by myself, so I need to trust someone else at some point. However, the smart scientists also once thought the world was flat. So that by itself is not necessarily a definitive argument.

        On the flip side, there was an Australian guy who did a PhD auditing one of the major data sets that these analyses are built on. And he found some serious (like high-school level) errors. Now, as I say, I don't know enough about the meteorological analysis to do it all myself, but one thing I do know how to do is check data sets! And so I checked the data set for some of the errors that he had found, and sure enough found them myself. Things like snow-like temperatures in the Caribbean in mid summer. Misspellings of countries. Now these errors may not change the overall conclusion, however errors of that nature in a fundamental data set doesn't give me any confidence in the rest of the data as a statistician. At the very least, it raises the question of why a PhD student found mistakes that a substantial government institute didn't.

        So as I say, I haven't really decided where I sit on that one yet. Still collecting data before I make my conclusion I guess ;)

        • +1

          Thanks mate, much appreciated.

        • to jump on the bandwagon …

          if your job is "climate change scientist" and your job is to look for signs of climate change, only to find that there are no signs of climate change any you've just wasted 4 years of your life and will no longer have a job, do you:
          a. come clean and say there's no problem and look for a different career (when your university degree is no longer worth anything)
          b. keep quiet and keep earning good money (while claiming how bad the climate is to justify your next grant)

          The earth has a really good system:
          - burning all of these fossil fuels makes the earth warmer
          - CO2 / polluiton creates smog / smoke
          - smog blocks sun-rays (making it colder) while keeping in heat (keeping it warmer)
          - if it gets too cold, ice forms, trapping mostly CO2
          - if it gets too hot, ice melts, releasing C02

          Sure, we should do our bit to reduce pollution, we should focus on sustainable and renewable energy sources and we should be leaving the world a better place than we inherited it …

          But, if you really want to save the world, start a war, kill a few million (billion) humans, less people means less animals and trees being killed, less polution, less waste, fewer cars - save the world, kill a human!

          Now you get into the fun part, if you had to kill off half of the worlds population, how would you choose which half? Should we kill all of the white people? How about all of the old people? All of the unemployed? All of the poor? All of the rich? All of the un-educated?

  • +1

    I misread the title as "I'm a Satanistic Data Scientist", and had to do a double take.

  • Is data scientist = lousy programmer + not too good statistician with basic maths skills?

    • Interesting question. I don't think your equation is accurate though.

      Speaking in vast generalisations Data Scientists probably wouldn't usually have the software development skills of a software engineer, but many of them are still pretty gun. They may not have the level of statistical skills of someone who would more readily identify themselves as a statistician, but again, many of them are pretty good in this area too.

      Some people have made the pithy observation that "A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician." There's no official definition, but this is about as good as any I've seen. But the lines are getting very blurry these days…

      There's a more detailed discussion at https://www.quora.com/What-is-the-difference-between-a-data-….

      • Isnt that data scientists always heavily rely on libraries of algos that developed by programmer based on maths and methods devised by mathematicians and statisticians?

        And all they do all day long is cleaning, crunching, waiting, testing, rinse and repeat? What is so sexy about that?

        • There are some that fall into that camp, but the same could be said of a lot of practitioners though right? Your implication seems to be that Data scientists are somehow less qualified, useful or necessary because they only apply tools developed by others.

          All Doctors do is prescribe medication developed by pharmacists, so are they less useful? All engineers do is apply mathematics developed by mathematicians, so are they less qualified?

          From the article linked above:

          […] it often falls to data scientists to fashion their own tools and even conduct academic-style research. Yahoo, one of the firms that employed a group of data scientists early on, was instrumental in developing Hadoop. Facebook’s data team created the language Hive for programming Hadoop projects. Many other data scientists, especially at data-driven companies such as Google, Amazon, Microsoft, Walmart, eBay, LinkedIn, and Twitter, have added to and refined the tool kit.

          So data scientists can do a fair bit of novel work, just depends on their job and personal drive I guess.

          What is so sexy about that?

          I actually agree with you. I don't think there's anything particularly sexy about the job compared to any other. I guess the point of the original article is that data is undeniably in huge (and exponentially growing) demand in the 21st century so far. Hence, professionals with the skills to make use of that data are very attractive to companies. The name you give them is just semantics, and data scientist is the currently accepted term.

    • Another data scientist here, and my background is applied statistics.

      Honestly, the data scientist title is misleading in a way, and I don't like the title at all.

      If you read enough job descriptions, you will find most of them are asking very different things, and that scale stretches from hardcore programmer to statisticians, and a lot employers have no idea what they are looking for, except yep I am after a data scientist and that is all.

  • Benjamin Disraeli or Mark Twain?

    • Can't say I've read either, but I at least have something by Twain I think

      • whooosh

  • Hi,

    I working in medical science but am growing interested in combining data science in health/medicine and biology. I am OK with using R for statistics but that's about it. Haven't worked with "big data."
    Do you have any tips for people like me?

    Cheers

    • You sound like you're on the right track already. Do you have formal training in stats? If not, that's probably the main area you want to upskill. You don't necessarily need to work with big data, it's more about doing the right thing with the data you have, rather than just anything with a whole ton of data.

      Does that help? Feel free to PM or reply if you want to discuss more :)

  • +1

    Not too sure if this has been asked, but what are your tasks as a data scientist in relation to grains research?

    • Haha no not been asked yet despite all the questions for advice how to do what I do!

      My job at the moment is partly teaching professional development workshops in statistics for experimental design and analysis. Another part is consulting on experimental design and analysis for grains trials. And the final main part at the moment is web development for a conference we are hosting, and the webpage for our group. Then there's other little bits and pieces including some statistical software development and personal research projects.

  • +1

    Do you feel you would make a good psychologist ?
    I ask this, as a psychology student currently, where I noticed many of my peers struggle with the stats. First year is fine, but second year and after that we seem to have some students who would make great psychologist, but they struggle immensely with 'stats' .
    I wonder of a statistics major, would struggle with other parts of psychology and practicing as a clinical psychologist🤔

    Perhaps kind of like a different parts of the brain thing, where it is difficult and/or rare to excel at both (both stats and maths involved, but also all other parts of the brain involved, as a psychologist).

    • +1

      Hm, great question. I've never really thought about becoming a psychologist, but I do find it fascinating. I'm not sure if I could hack being a clinical psychologist. I do love helping people, but I might just find it too depressing? At least wheat doesn't break up with its partner and contemplate self destructive behaviours ;)

      I know lots of psych students struggle with the stats, but I know a couple as well who were doing all my stats courses with me. You may well be onto something with the different parts of the brain, or perhaps a particular temperament is more attracted to one or other of psych or stats.

  • +1

    Does your institution have on-prem supercomputing and do you use it?

    What sort of mix between cloud (e.g. AWS) and on-prem do you use for your computing and what makes you choose one over the other?

    • Yes we do have on-prem, but we don't use it very often. Most of our HPC stuff we do on AWS, mainly for the control. Our on-prem is shared and hence has limits and schedules and things, which just makes it a bit of a pain to deal with. We may move more to OP, in future. Not sure yet. There is also a shared cloud resource for Australian research institutions called NECTAR that we may look into using more.

      • +1

        Were you using eRSA and by on-prem are you talking about Phoenix :-) ?

        • … maybe… ;)

          You SA based?

  • Hey,

    Can you help me with a quiz I have for homework? I think I got the answers right but wanted to check. Can I PM you?

    • You can send it through. No guarantees though

      • No worries. It just a few multichoice questions.

  • +1

    Do you have a blog? If yes, what's your workflow for blogging, GitHub pages, Medium, LinkedIn?

    • No, but I have been doing some web development recently for a conference. Pages are done in Hugo (some through blogdown R package), hosted on Github private repo which auto-deploys on Netlify. It's a really nifty system, and it makes static web pages really easy (as long as you can find a Hugo theme you don't want to change too much).

  • +1

    Do you use decision trees to choose between alternatives in your personal life? I had a mate over tonight who's just started dating a lovely girl and his ex found out about it ,got jealous and is now trying to win him back.

    A slab of beer and a decision tree weren't helpful in solving his dilemma. So I suggested we toss a coin. I fear I may have given him the wrong advice. We should have stuck to the decision tree, shouldn't we?

    • The magic 8 ball will hold the answer.

      • $16! That's some expensive advice. At least mine were free :b

        P.s: All good the missus saved the day , as usual :)

        • Haha great it all worked out.

          • @moar bargains: Yeah, I hit the jackpot - my missus is 3 standard deviation above the norm :)

            In case decision trees don't work for you either, this is what she did:
            When faced with 2 choices, simply toss a coin. It works not because it settles the question for you. But because in that brief moment when the coin is in the air YOU SUDDENLY KNOW WHAT YOU'RE HOPING FOR :)

            My mate is now single but only because he realised that he didn't feel strongly enough about the new gf and that he deserved better than the ex that cheated on him.

Login or Join to leave a comment