[AMA] I'm a Statistician/Data Scientist, Ask Me Anything

Hi all,

I'm a statistician/data scientist (apparently the sexiest profession of the 21st century) currently working for a university in grains research. Previously I've worked as a statistician in food safety research, and as a software developer.

Been around ozbargain a while, but never done one of these.

Will do my best to answer so ask away!

P.S. Yes, I know 73.2% of statistics are made up. That's not my job, that's called a politician.

closed Comments

  • +3

    Is the article glamorising the job or is it really like that. For example; your job is to be a renegade odd ball that enters a corporation, sits in the odd ball corner and tests and implements mathematical theories that challenge the status quo and bring about revolutionary change to an industry?

    • +1

      It probably depends on the industry/company, but my experience (and I suspect most of those in the profession) is no where near as glamorous. It can often be very isolated, searching for needles in haystacks.

      • +14

        It probably

        Should have led with "In all probability," instead :)

      • How's researching grains sexy? What you doing with them?….

        • You might be surprised what they get up to!

  • +2

    If I am engineer with basic programming skills, what should I do to become a data scientist besides having an interest in it. Should I get qualified via a PHD or Masters in Data Science or Software Engineering degree or Maths PHD etc.

    Basically, how do I get started?

    • +3

      Good question. I think it depends a bit what you're aiming for because it's a huge field. I wouldn't dive straight into a PhD though.

      If you are looking for jobs in things like machine/deep learning, then you probably have most of the skills already, so key thing you probably want to look at learning is when to apply different techniques/tools - statistical literacy essentially. There are a huge number of online resources around this area, and so it wouldn't be very difficult to learn (some of) the skills online. Things like Kaggle, Data camp, some others I can't remember at the moment…

      If you want to work in experimental design, clinical trials or things like that, then you probably need to look at a Masters/PhD in (bio)statistics.

      Software engineering though is not data science, and certainly not statistics. They all overlap to some greater or lesser degree, but they are not identical.

      Does that help?

      Edit: And you'll definitely want to learn R and/or Python.

      • I already work as a quant analyst in finance so know R/Python. However when looking at data scientist ads they have a long list of frameworks each of which seem to have a decent learning curve. Like Hadoop, Apache Spark, PySpark, Kafka + various cloud technologies. Which frameworks are the most popular in the industry?

        • +1

          I'm not really in that scene, so can't really comment. Familiarity with Linux goes a long way (all the cloud platforms usually run Linux), and then I'd suggest just pick a couple and have a go. E.g. see if you can play around with some python libraries. It's impossible to be an expert in all the technologies/frameworks, so my approach is pick a couple you can do well, and use that to prove you could do the others if needed.

        • +6

          Which frameworks are the most popular in the industry?

          You haven't already scraped all the job ads and extracted the answer from the data? ;)

        • Working as a quant analyst sounds really interesting. Do you have any advice for someone who works has an actuarial background to move into a very technical position such as a quant analyst. A lot of position descriptions have a PhD as a requirement, do you think that's accurate?

          • @Happy101: If you have an actuarial degree, you'll have all the technical knowledge required. The only thing you lack is domain knowledge. If you work in a large company look to transfer to a quant risk division. The only small companies hiring quants are trading firms so you'll need to see what they require for entry level quant positions and focus on accumulating domain knowledge in that.

            Personally, I think the traditional finance industry is stagnant which means job creation and turnover is rare, which also means career progression is very difficult. Data science is still very much in its infancy in that the problems businesses are facing are implementation (how/what/create systems to use) rather than innovation (developing new models/techniques). Furthermore, once these systems are in place and workflow established, they're going to need people to maintain existing systems and start investing in innovation to maintain a competitive edge.

        • I would say 95% of the data scientists don't know how to use Spark let alone Kafka. I always wish my Data Scientist colleagues to use at least PySpark (I can't understand why Pandas is so popular where Spark Dataframe API is so much cleaner, but you insists on Pandas there's Koala for Spark :)). I find most of the time DS just use either Gradient Boosting or Deep Learning in case of NLP or Videos/images. Effective DS knows the features are important for business problems the type of the data. We build demand forecast models but FMCG products are different from Slow moving products.

      • Piggybacking off this comment, how would someone working in the actuarial field move over to data science? Are languages like SAS relevant?

        • Never heard of SAS used in data science, although I'm sure SAS will tell you otherwise! You're better off learning Python or R, and trying to learn some frameworks related to where you're aiming (ML, AI, HPC, etc).

          • +1

            @moar bargains: Thanks for your response! And I thought as much - for some reason, SAS doesn't seem to be used at all in data science. Guess I'll have to spend time working on side projects on Python.

            • +1

              @hehlo: I think understanding SAS as a proprietary product should explain why it doesn't appear to be used. It definitely is, by some of the biggest and most important clients in Australia (and the world), it's seriously no joke.

              It's just quite hefty and quite expensive to get started with, and is geared towards high performance, stuff that is simply impossible to do with python (because you can't get enough speed).

              At uni you won't be using terabytes and petabytes of data that need extremely fast processing and response times within seconds. And you won't be processing millions of transactions per minute on a side project either… so it's a matter of perspective!

              It's easy to get into data science with open source but there's a big reason why there are so many heavyweights like oracle, SAP, IBM and microsoft still kicking and making tons of money off data science and analytics.

              • @andgucps: I understand that SAS is really expensive to get started with (especially those training courses they offer). But don't other languages require licensing fees as well? I know R Studio requires a license for commercial purposes, but I'm not too sure about numpy and other packages for Python.

                • @hehlo: Rstudio doesn't require a licence unless there's a couple of particular features you want to use, or you want the support, which some enterprises do. Python and it's libraries don't have any licence fees as far as I know (but that's not very far).

              • @andgucps: Yes, this is a fair comment. I know SAS is widely used, in large enterprises (e.g. ABS). And one of my previous jobs was using an SAP in-memory database, so as you say the big players are still in the game.

  • +1

    How do you feel about Simpson claiming onwership of the Simpson paradox despite the fact it was first noted by Yule in 1903 ?.

    • +1

      An appalling act of academic thievery!

      To be honest I'd never heard of it, and it really has no impact on me, so not bothered.

  • AMA

    What are the odds that people really understand how statistics can actually be misused?

    • +1

      What are the odds that people really understand how statistics can actually be misused?

      I'm not sure which way you're going with this one? Are you meaning a) people understand and deliberately misuse stats, or b) people don't understand and unintentionally misuse stats?

      Either way, misuse of stats is rife, and a huge issue IMO. I am probably still too young and optimistic, but I don't think the majority is done maliciously or deliberately (with the possible exception of in politics and some journalism, but journos are improving). Probability is not something humans are naturally very good at, and statistical literacy among the general population is very poor. Statistics are absolutely central and critical to science, but very often it is poorly understood. Many people (even many scientists) seem to have an aversion to mathematics, and even more so to statistics.

      It's a bit like websites. It's not actually that hard to make a website if you've got a bit of computer literacy, but to do it well is a different story. Hence there are websites that look terrible, and store passwords very insecurely for example. Just because you can do it yourself, doesn't mean you shouldn't consult a professional if it's important.

      • +5

        What percentage of the population realise stats misuse is a real problem that affects their daily lives?

        • Unfortunately, very few…

          • @moar bargains:

            What percentage
            very few

            hmmm.

            • @MrBear: Ok, a very small percentage. I'll generously estimate 1%.

              • @moar bargains: Given the understanding of statistics I see among tertiary students in statistics courses, I'd estimate the true proportion in the entire population would be less than .05

                • @ozbjunkie: Yes, that's probably even more realistic, probably still generous. That still means ~1.2M people in Australia have a good understanding.

                  What do you teach?

                  • +1

                    @moar bargains: Jeebus. You want to have another go at calculating half a percent of 24M?

                    • @orly: Hmm… Where'd that extra zero come from 🤔
                      Good pick up, thanks! Good thing I'm not in charge of payroll!

                      • +1

                        @moar bargains: I think it's reasonably common to know that stats are abused, although they may not be able to point to specific examples and explain the errors, surely more than 1% of people get the jokes like "73% of all statistics are made up on the spot", and "There are three kinds of lies: lies, damned lies, and statistics."

                        • @abb: Yeah fair call. It probably is fairly common to understand stats are misused and abused, but I guess what I was meaning was people who understand what the problems are and why, rather than just "you can't trust stats because stats".

              • @moar bargains: 60% of the time, everytime.

  • +5

    So hot right now… question is, how do you store all the ladies phone numbers? You must have so many.

    • +6

      all the ladies phone numbers?

      Don't forget the blokes. Statisticians love to diversify their data sex sets.

    • +3

      Well, dealing with big data (like all the phone numbers you get given) is part of the skill set ;)

  • What qualifications do I need to get into becoming a statistician ?

    • phd or masters by research in stats?

    • Again, as above it depends a bit on your end goal.

      Most people though will do a Bachelors of Maths or Science majoring in stats (often with Honours or Masters). Some will retrain via a Masters or PhD after a first degree in a different field.

      If you want to work in research (and progress) though, it almost always requires a PhD eventually.

  • Tertiary qualification?

    These days there "Data Science' degrees at unis, but a few years ago it wasn't so.

    • +1

      Tertiary qualification?

      Yes for me, and almost invariably so in the profession. There's a fair bit of high level maths involved.

      These days there "Data Science' degrees at unis, but a few years ago it wasn't so.

      Yes, and I'm a still a bit sceptical of them. When I graduated in the early 2010's, there weren't any around. Now just about every uni has one. I guess what I'm concerned about is that Unis are just trying to cash in on a trend, and not teaching the required skills properly?

      • +2

        I currently work at an Australian University, teaching maths, and I can confirm from talking to my statistician colleagues that the data science courses are definately half baked ideas with no real thought put into the design of the degree. I can't speak for all tertiary institutions, but I'm definitely not convinced by what's offered where I work.

        • I can't say I'm surprised unfortunately, and I'm sure it's not just your institution.

      • As in, what tertiary degree did you do? I assumed that you did one was a given.

        • Ah, sorry. See here. Bachelor of Mathematical and Computer Sciences majoring in statistics from University of Adelaide.

  • Are you the Quant? E.g. Jenga Pitch- The Big Short

    • No, never worked in finance, and don't really plan to.

  • Can you become the next Cambridge Analytica?
    but for good purposes? :)

    • +3

      Haha I wish! At the moment I'm helping improve the yield of Australian grain, which will help us feed the world, so I guess that's a good thing?

  • +1

    What are the winning lotto numbers going to be for the next big draw?

    • +15

      23, 14, 42, 45, 13, 4 with probability 0.00000012.

      One of the first things you learn in stats is not to gamble…

      • Related, with American figures it seems: https://math.stackexchange.com/questions/1604164/how-to-calc…

        • +1

          Of course like all statisticians you should qualify your response.

          1. What amount defines a big draw?
          2. The next big draw in which country
          • @RockyRaccoon: True, but the formula in the link above accounts for these, and will give an expected result once they're defined.

      • 100% running these numbers tonight ~ those odds are way higher than what lotto actually is.

        $1.30 1 game - locked in.

        @remind me ~ in 24 hours - will report my win/loss

        • Haha I wouldn't actually put any weight on my values above. I have no idea how lotteries work really, so I googled it and these were the most common numbers for the last some number of draws. And then I had a guess at how likely it was that these numbers would come up and calculated the probability of that.

          • +1

            @moar bargains: better than anything else lol

            ~ but would be cool ozbargain story to tell

            AMA - I won the lotto off numbers supplied from another AMA from a stats guys! - Ask me anything!

  • Where will AI lead us?
    As a computer scientist working on a unique approach, my opinion is by current models, only predictable directions. Still that's very useful and dangerous.
    My approach is emulating quantum and theoretical sub-quantum states in a software based translation of our perceived reality.
    However, as you are entrenched in organizational research and applications, and involved with many others, I'd like your opinion of where we may end up in 20 or 50 years with AI development.

    • +1

      I suspect that humans will not be able to comprehend the technological singularity. It may even go unnoticed.

      • Good points. We probably need to explore reality more deeply, discard our 5 to 7 sense perception, and as even the myriad of anomalies determine causality clearly, AI will need to be built on a deterministic interpreter model (Example see witchcraft/healing), rather than evolved animal hallucination of the field-based illusion leading to spacetime as the real, complete and source of the effect we experience (sorry lost track, lack of sugar maybe).
        Anyway, until we build an interpreter between reality states, my opinion is AI is as harmless as any other weapons the bad guys build.

    • +1

      That's a very interesting question, and to be honest you are probably much more qualified to share your opinion than I am!

      I agree with you though. I don't think that we will see the apocalyptic results that Hollywood seems to love. I think there is the danger of some substantially negative outcomes though. Someone above mentioned Cambridge Analytica. Between this and other scandals/revelations, I think we will see regulatory push-back on "algorithms" running things over the coming decade or two? Not sure. Will have to give it some more thought and investigation.

  • What is the likelihood of developing Quadrotriticale ?

    • I had never heard of this, but given it's from a sci-fi show I'm gonna say low ;)

      I don't know enough about the biology to know if such a thing is even possible, but I would hazard a guess that it's not.

  • Are you using AI in particular neural networks to analyse protein folding?

    And if so do you rent GPU cloud time from Google?

    • We don't analyse protein folding. I think that's more proteomics stuff? We have done some very simple NN stuff relating to image classification. Most of the big stuff we do is genetics data.

      We usually use Amazon web services as our cloud provider. Have used Google (and MS Azure) in the past, but mainly as a trial.

  • +5

    Whats difference between Data analyst, Data engineer, Data scientist, Data architect and Data evangelist? (nowadays, almost everyone is bloody architect or evangelist on Linkedin.

    • +5

      This is a good question, and one I expected to turn up earlier!

      I have no idea what a data evangelist is, but to me it sounds like a salesman: "use our data product!".

      There is quite a bit of overlap between the others IMO. In general, I would see a data analyst as less qualified (have previously worked as a data analyst and didn't need a stats degree). It tends to be more "interpretation of information" rather than actual statistical analysis based on models and probabilities. They are probably also more general? I.e. you'll find a data analyst working in everything from stock trading to foreign intelligence.

      A data scientist is often synonymous with a statistician, although a DS probably tends to deal with larger data sets and things like machine learning more than a statistician. The major difference is perhaps their approach to problems? A statistician tends to think in terms of mathematical models, whereas a DS would tend to think in terms of tools (often code or software) to apply to a problem I think?

      A data engineer and data architect are probably more found in very large enterprises. I would see a data architect as designing the systems needed for a company. E.g. databases, where they are hosted, what systems need to interact with them etc. And a data engineer is probably more around implementation of those systems.

      Those are my opinions of the distinctions, however there are no defined criteria, so many people use them interchangeably.

      • Great thanks. What tools/language do you use to clear or filter data before it can be used for statistical analysis or do you leave this to data analyst to give you pure and clean data?

        • +1

          pure and clean data?

          Ha! A fair chunk of many statistician's job is data cleaning. The majority of what I've encountered professionally to date is hand entered into excel spreadsheets, so that leads to all sorts of problems… I personally use R, but there are others who use python, and even Excel can be a useful tool for some things. Depends really what you do and where the data is sourced from.

          • +1

            @moar bargains: Then follow up question…what % of your time goes into cleansing data, getting/fixing source, analyzing data vs % of time you spend on real statistical analysis aka data science?

            • @pyramid: At the moment it's pretty negligible, maybe 5% of my time or less goes into cleaning data (getting, fixing, writing the code etc). But that's because it's not a huge part of my role at the moment. So probably only another 10-15% for actual statistical analysis. In the past it's been up around 50% (cleaning) at times, like when I was doing summer holidays stats work while at uni.

              • @moar bargains: Thanks. Do you use spark and its MLib for model training and analysis?

                • @pyramid: Nah, can't say I have. Do you work in the field?

                  • +2

                    @moar bargains: Not really into field of data science or statistics but personal learning around spark/python etc…i was having conversation with friend of mine about spark and its use cases. But we do not know if anyone here is really using spark and its libraries for data analysis or data science purpose.

            • @pyramid: Cleansing is the job of the data entry person. Identifying the issue through quality assurance is the job of the data analyst through a reporting mechanism. Any cleansing needs to be highlighted not hidden in a transformed and coded dataset, operationally speaking.

              When you’re migrating a dataset then yeah, cleanse the balls off it and drop it in as correct as possible.

              • @zqipz: Sounds like you work as a database dev or admin, or business analyst. What you've described is how it should work for regular reporting in a business, but for statisticians that is rarely the reality. Many statisticians work in research, where the data often doesn't come out of nicely formatted databases. Those that work in business are usually developing new reports or trying to answer new questions, so often the data comes from disparate sources.

                Any cleansing needs to be highlighted not hidden in a transformed and coded dataset, operationally speaking.

                Absolutely agree. This is one reason why I (and many statisticians) use R for data cleaning rather than excel. It's a scripting language, so you have a record of all the changes you have made, and can then proceed from there.

                • @moar bargains: Business Intelligence and Analytics. Many situations require many different approaches. Those examples aren’t definitive of why you’d cleanse.

  • +1

    What is the standard deviation of questions in this thread that are off topic?

    • +8

      Well, if we give a value 1 to questions on topic, and 0 otherwise, we get…

      Hold on, have I just been nerd sniped?! https://xkcd.com/356/

  • Do you take your analysis with a grain…

    • Haha a grain of grain?

      "All models are wrong, but some are useful" - George Box, a statistician.

  • What is the likelihood that humanity will reach a utopian/dystopian state in the next 100 years?

    • 0.4023 with a standard deviation of 0.586

      In all seriousness though, this question is deep and all I can give you on this is my personal opinion. Problems like this are actually really difficult to predict/evaluate, due to the number of factors and variables involved, and I'm too lazy for that!

      I personally don't believe we will ever reach a utopian state on earth, because humanity. That said, I'm not sure we will ever reach a completely dystopian state either. I think we will exist in a range, without reaching either end-point, but constantly moving one way or the other.

      • I should clarify, I am referring to utopia/dystopia for humanity as a whole on earth. I am legitimately concerned that there will be pockets of dystopia in various locations and forms. I guess it depends how big a system you account for.

  • +1

    Thoughts on https://www.ozbargain.com.au/node/476642 ?

    Do you bother/"waste" money gambling on lotto/Lotteries personally? If you don't do you pity/look down on those that do?

    What software/hardware do you use and how much data do you analyse typically in your role?

    • +1

      See above. I sit in the bottom poll option (not worthwhile) of the post you linked. I think I've only ever put $1 in a pokie machine, when I had recently turned 18.

      Lotteries, casinos, pokies and gambling in general are fascinating from a statistical and psychological point of view though. They are a perfect example of the fact that many people have very little understanding of probability. Even if the odds are only very slightly in the favour of the house, applied over a large enough group they always come out ahead.

      • What about sweepstakes? Personally I can't be bothered doing all the clicking & providing information that may result in getting spammed, what are your thoughts?

        • I guess the key thing to consider with all these things is who is running this, and what are they getting out of it? If it's a for-profit company, you can be sure they are running it with odds that will benefit them (i.e. they'll make more money, potentially though marketing to you in future, than they spend on promotion and prizes).

          I used to enter a lot of competitions and things, but like you, I'm a lot more cautious now with providing personal information.

      • did you win anything from that dollar?

        • I honestly don't remember but I don't think so.

    • +2

      Just saw the edit: I don't pity those who do buy lottery tickets any more than I pity those who buy cigarettes. I personally don't do either, as I think both are unwise choices, but people are free to make their own choices. I would like to think I look down on them either, but perhaps that needs some more self-reflection.

      I think a lot of people would benefit from a little more statistical literacy and understanding, but I know it's a difficult task. Part of my current job is to run professional development courses for researchers to improve their statistical literacy and skills, and there's plenty enough work there for me! I think with lottery tickets it's also more complicated than I've made out. If someone is spending a few dollars a month on a lottery ticket because it's a "treat" and they like to dream about the winnings, I think that sort of optimism and hope is an intangible benefit very hard to quantify. Similar to someone who would go and buy a pastry from a cafe once a week or something. It's not really going to benefit them long term, but life sucks sometimes, and it's nice to have little things that can pick you up.

      Sorry, what started as one sentence has become an essay… Great question, thanks!

      Some of the software tools I use in my job (apart from standards like browser and MS office):

      • R & Rstudio
      • Shiny
      • Python
      • Git
      • LaTeX
      • Hugo (Been doing some web development lately)

      I run a spec'd up Dell laptop (1Tb SSD, 32Gb RAM). It's not actually all that special by the standards of what's around today, but it's by far the best computer an employer has ever given me.

      how much data do you analyse typically in your role

      At the moment, it's not much in terms of volume. I have previously worked on spreadsheets of 400K+ rows, but that's not even that big in the scheme of things. I have also worked on some bigger datasets for a government department looking at fraud detection. Mostly these days it's a few hundred rows (A few megabytes tops).

    • +1

      I'll jump in here as someone who's studied a bit of psychology …

      They did a test with pidgeons, had a button and taught the pidgeon to press the button to get food …
      They changed it so that the button would only give food after 5 presses, then turned it off and monitored the time / duration before the pidgeon gave up …
      They then changed it so that it would only give food after a 30 second interval and monitored the time it took after switching off …
      They then did random, sporadic rewards and monitored the pidgeon continue pressing the button until it died of starvation …

      Psychologists are sick b*stards

      • Haha this is the sort of thing I love about Psych. What does that even prove or demonstrate?! And who was it that gets to say "my PhD was in pigeons pressing buttons"?!

  • +2

    I'm on the board of a small charity. We need a statistician to help work out the cost of having preventative treatment v not having and requiring further surgeries in the future. Do you know of any statistician charities etc that might be able to lend us their services?

Login or Join to leave a comment