FLM does not hook online after power outage

by gebhardm on Wed, 21/11/2012 - 18:14

Today (again) we had a power outage from 14:57 until 15:50; while the SolarLogger went up (so I could detect the interval of power not available), the FLMs (two) did not get into the internet again:

Jan 1 00:02:07 flukso-c13f30 user.err heartbeat[832]: POST "https://api.flukso.net/device/c13f3032d717d096065d8cf524116524": -2
Jan 1 00:02:07 flukso-c13f30 user.err heartbeat[832]: Name or service not known
Jan 1 00:02:07 flukso-c13f30 user.err heartbeat[834]: POST "https://api.flukso.net/device/c13f3032d717d096065d8cf524116524": -2
Jan 1 00:02:07 flukso-c13f30 user.err heartbeat[834]: Name or service not known
...
Jan 1 01:01:02 flukso-c13f30 user.info heartbeat[963]: POST "https://api.flukso.net/device/c13f3032d717d096065d8cf524116524": 200
Jan 1 01:15:01 flukso-c13f30 cron.info crond[770]: crond: USER root pid 1051 cmd [ $(/usr/bin/fcheck; echo $?) -ne 0 ] && reboot
Jan 1 01:30:01 flukso-c13f30 cron.info crond[770]: crond: USER root pid 1060 cmd [ $(/usr/bin/fcheck; echo $?) -ne 0 ] && reboot
Jan 1 01:45:01 flukso-c13f30 cron.info crond[770]: crond: USER root pid 1138 cmd [ $(/usr/bin/fcheck; echo $?) -ne 0 ] && reboot
Jan 1 02:00:01 flukso-c13f30 cron.info crond[770]: crond: USER root pid 1147 cmd [ $(/usr/bin/fcheck; echo $?) -ne 0 ] && reboot
Jan 1 02:01:01 flukso-c13f30 cron.info crond[770]: crond: USER root pid 1156 cmd /usr/bin/heartbeat 0
Jan 1 02:01:02 flukso-c13f30 daemon.info dnsmasq[809]: reading /tmp/resolv.conf.auto
Jan 1 02:01:02 flukso-c13f30 daemon.info dnsmasq[809]: using nameserver 82.212.62.62#53
Jan 1 02:01:02 flukso-c13f30 daemon.info dnsmasq[809]: using nameserver 78.42.43.62#53
Jan 1 02:01:02 flukso-c13f30 daemon.info dnsmasq[809]: using local addresses only for domain lan
Jan 1 02:01:02 flukso-c13f30 user.info heartbeat[1157]: POST "https://api.flukso.net/device/c13f3032d717d096065d8cf524116524": 200
Jan 1 02:15:01 flukso-c13f30 cron.info crond[770]: crond: USER root pid 1239 cmd [ $(/usr/bin/fcheck; echo $?) -ne 0 ] && reboot
Jan 1 02:15:45 flukso-c13f30 user.err fsync[1258]: POST "https://api.flukso.net/sensor/f628158301152d75538cef50736f338a": -150
Jan 1 02:15:45 flukso-c13f30 user.err fsync[1258]: POST "https://api.flukso.net/sensor/d6b662b10579091cbb5f39170c678f5e": -150
Jan 1 02:15:45 flukso-c13f30 user.err fsync[1258]: POST "https://api.flukso.net/sensor/b6ae3286419bf81e3a6fc1d192666286": -150
Jan 1 02:15:45 flukso-c13f30 user.err fsync[1258]: POST "https://api.flukso.net/sensor/0ebaec40f9e80ff39534d745d8e0fd4d": -150
Jan 1 02:15:45 flukso-c13f30 user.err fsync[1258]: POST "https://api.flukso.net/sensor/97eb5fb3198888e1c21ddf7086361269": -150
Jan 1 02:15:45 flukso-c13f30 user.err fsync[1258]: POST "https://api.flukso.net/sensor/413509b4c041f9a4dc3b50f52f0fc62f": -150
Jan 1 02:15:45 flukso-c13f30 user.err fsync[1258]: fsync exit status: 7, synchronisation with Flukso server failed

I did not do a hard reset before re-saving the config from the web-interface to force a sync; but this also showed a "synchronisation with Flukso server failed"...
@Bart: May you have a look before I do a hard reset on both? (this account's FLM is re-powered; gebhardm2 is in the condition after the outage)
The same behavior applies to both my FLMs in the same manner - so they seem to not really like "forced power down" situations; maybe they don't like when the internet modem is not up fast enough...
Strange...
Best regards, Markus

icarus75 | Wed, 21/11/2012 - 22:09

If you look at your trace, the root cause is the time not being set correctly through NTP. That's why the syslog entries are timestamped with "Jan 1 ...". The funny thing is that during my investigation, the ntpclient seems to have done its job without any intervention from my side. (Note that system time is in UTC)

Nov 21 20:02:33 flukso-c13f30 cron.err crond[776]: time disparity of 22558792 minutes detected

I do believe the slow system clock syncing is probably due to your router/network not being fully up-and-running before the FLM makes its first NTP call.

Could you try to run another test by cold starting the router and FLM at the same time and verifying the system clock setting after boot?

gebhardm | Thu, 22/11/2012 - 07:52

Thanks, Bart. The 20:02 event was a cold start of the FLM. Will check your question when I get physically to the device again, thus probably this evening.

gebhardm | Sat, 01/12/2012 - 17:12

...the time disparity I also recognized on the simulated power outage... ref to other thread... (and close this one)