{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Least squares fitting of models to data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a quick introduction to `statsmodels` for physical scientists (e.g. physicists, astronomers) or engineers.\n",
    "\n",
    "Why is this needed?\n",
    "\n",
    "Because most of `statsmodels` was written by statisticians and they use a different terminology and sometimes methods, making it hard to know which classes and functions are relevant and what their inputs and outputs mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
 
 
 
 
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assume you have data points with measurements `y` at positions `x` as well as measurement errors `y_err`.\n",
    "\n",
    "How can you use `statsmodels` to fit a straight line model to this data?\n",
    "\n",
    "For an extensive discussion see [Hogg et al. (2010), \"Data analysis recipes: Fitting a model to data\"](https://arxiv.org/abs/1008.4686) ... we'll use the example data given by them in Table 1.\n",
    "\n",
    "So the model is `f(x) = a * x + b` and on Figure 1 they print the result we want to reproduce ... the best-fit parameter and the parameter errors for a \"standard weighted least-squares fit\" for this data are:\n",
    "* `a = 2.24 +- 0.11`\n",
    "* `b = 34 +- 18`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
 
 
 
 
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_nnnnnnn/2751177292.py:28: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\\s+'`` instead\n",
      "  data = pd.read_csv(StringIO(data), delim_whitespace=True).astype(float)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x</th>\n",
       "      <th>y</th>\n",
       "      <th>y_err</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>201.0</td>\n",
       "      <td>592.0</td>\n",
       "      <td>61.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>244.0</td>\n",
       "      <td>401.0</td>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>47.0</td>\n",
       "      <td>583.0</td>\n",
       "      <td>38.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>287.0</td>\n",
       "      <td>402.0</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>203.0</td>\n",
       "      <td>495.0</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       x      y  y_err\n",
       "0  201.0  592.0   61.0\n",
       "1  244.0  401.0   25.0\n",
       "2   47.0  583.0   38.0\n",
       "3  287.0  402.0   15.0\n",
       "4  203.0  495.0   21.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = \"\"\"\n",
    "  x   y y_err\n",
    "201 592    61\n",
    "244 401    25\n",
    " 47 583    38\n",
    "287 402    15\n",
    "203 495    21\n",
    " 58 173    15\n",
    "210 479    27\n",
    "202 504    14\n",
    "198 510    30\n",
    "158 416    16\n",
    "165 393    14\n",
    "201 442    25\n",
    "157 317    52\n",
    "131 311    16\n",
    "166 400    34\n",
    "160 337    31\n",
    "186 423    42\n",
    "125 334    26\n",
    "218 533    16\n",
    "146 344    22\n",
    "\"\"\"\n",
    "try:\n",
    "    from StringIO import StringIO\n",
    "except ImportError:\n",
    "    from io import StringIO\n",
    "data = pd.read_csv(StringIO(data), delim_whitespace=True).astype(float)\n",
    "\n",
    "# Note: for the results we compare with the paper here, they drop the first four points\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To fit a straight line use the weighted least squares class [WLS](https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.WLS.html) ... the parameters are called:\n",
    "* `exog` = `sm.add_constant(x)`\n",
    "* `endog` = `y`\n",
    "* `weights` = `1 / sqrt(y_err)`\n",
    "\n",
    "Note that `exog` must be a 2-dimensional array with `x` as a column and an extra column of ones. Adding this column of ones means you want to fit the model `y = a * x + b`, leaving it off means you want to fit the model `y = a * x`.\n",
    "\n",
    "And you have to use the option `cov_type='fixed scale'` to tell `statsmodels` that you really have measurement errors with an absolute scale. If you do not, `statsmodels` will treat the weights as relative weights between the data points and internally re-scale them so that the best-fit model will have `chi**2 / ndf = 1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
 
 
 
 
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            WLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                      y   R-squared:                       0.400\n",
      "Model:                            WLS   Adj. R-squared:                  0.367\n",
      "Method:                 Least Squares   F-statistic:                     193.5\n",
      "Date:                Fri, 20 Mar 2026   Prob (F-statistic):           4.52e-11\n",
      "Time:                        11:18:46   Log-Likelihood:                -119.06\n",
      "No. Observations:                  20   AIC:                             242.1\n",
      "Df Residuals:                      18   BIC:                             244.1\n",
      "Df Model:                           1                                         \n",
      "Covariance Type:          fixed scale                                         \n",
      "==============================================================================\n",
      "                 coef    std err          z      P>|z|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "const        213.2735     14.394     14.817      0.000     185.062     241.485\n",
      "x              1.0767      0.077     13.910      0.000       0.925       1.228\n",
      "==============================================================================\n",
      "Omnibus:                        0.943   Durbin-Watson:                   2.901\n",
      "Prob(Omnibus):                  0.624   Jarque-Bera (JB):                0.181\n",
      "Skew:                          -0.205   Prob(JB):                        0.914\n",
      "Kurtosis:                       3.220   Cond. No.                         575.\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors are based on fixed scale\n"
     ]
    }
   ],
   "source": [
    "exog = sm.add_constant(data[\"x\"])\n",
    "endog = data[\"y\"]\n",
    "weights = 1.0 / (data[\"y_err\"] ** 2)\n",
    "wls = sm.WLS(endog, exog, weights)\n",
    "results = wls.fit(cov_type=\"fixed scale\")\n",
    "print(results.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check against scipy.optimize.curve_fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
 
 
 
 
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a =      1.077 +-      0.077\n",
      "b =    213.273 +-     14.394\n"
     ]
    }
   ],
   "source": [
    "# You can use `scipy.optimize.curve_fit` to get the best-fit parameters and parameter errors.\n",
    "from scipy.optimize import curve_fit\n",
    "\n",
    "\n",
    "def f(x, a, b):\n",
    "    return a * x + b\n",
    "\n",
    "\n",
    "xdata = data[\"x\"]\n",
    "ydata = data[\"y\"]\n",
    "p0 = [0, 0]  # initial parameter estimate\n",
    "sigma = data[\"y_err\"]\n",
    "popt, pcov = curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma=True)\n",
    "perr = np.sqrt(np.diag(pcov))\n",
    "print(\"a = {0:10.3f} +- {1:10.3f}\".format(popt[0], perr[0]))\n",
    "print(\"b = {0:10.3f} +- {1:10.3f}\".format(popt[1], perr[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check against self-written cost function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
 
 
 
 
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a =      1.077\n",
      "b =    213.274\n"
     ]
    }
   ],
   "source": [
    "# You can also use `scipy.optimize.minimize` and write your own cost function.\n",
    "# This does not give you the parameter errors though ... you'd have\n",
    "# to estimate the HESSE matrix separately ...\n",
    "from scipy.optimize import minimize\n",
    "\n",
    "\n",
    "def chi2(pars):\n",
    "    \"\"\"Cost function.\"\"\"\n",
    "    y_model = pars[0] * data[\"x\"] + pars[1]\n",
    "    chi = (data[\"y\"] - y_model) / data[\"y_err\"]\n",
    "    return np.sum(chi ** 2)\n",
    "\n",
    "\n",
    "result = minimize(fun=chi2, x0=[0, 0])\n",
    "popt = result.x\n",
    "print(\"a = {0:10.3f}\".format(popt[0]))\n",
    "print(\"b = {0:10.3f}\".format(popt[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Non-linear models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
 
 
 
 
    },
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# TODO: we could use the examples from here:\n",
    "# http://probfit.readthedocs.org/en/latest/api.html#probfit.costfunc.Chi2Regression"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
