News feed

Select project:


2023-08-30: Database upgrade

The MySQL database for LHC@home-dev has been upgraded to MySQL8. Normally this should not affect any of the functionality of the BOINC applications.

2022-11-09: Server Release 1.4.0

The server has been upgraded to the 1.4.0 tag for evaluation before the official release. Please let us know if there are any issues.

2021-08-20: CMS job queue to drain this weekend (21/08/2021)

CMS is about to release a new version of WMAgent based entirely on python 3. They have asked that they be able to update our agent by Monday evening (23/08), so I will not inject any new workflows before the upgrade. I expect the job queue to drain by late on Sunday.
Please set your CMS application to no new tasks by then.

2021-08-20: CMS job queue to drain this weekend (21/08/2021)

CMS is about to release a new version of WMAgent based entirely on python 3. They have asked that they be able to update our agent by Monday evening (23/08), so I will not inject any new workflows before the upgrade. I expect the job queue to drain by late on Sunday.
Please set your CMS application to no new tasks by then.

2020-04-14: Server upgrade

The LHC@home dev server has been upgraded to server release 1.2.1. (in practice it was running this code already, but now the project has undergone the usual server upgrade process.)

2019-11-27: CMS@Home disruption this week

It appears that a database intervention at CERN went badly, leaving our data tables empty and us not being able to submit new CMS@Home jobs. Advice is that it will take several days to recover -- and as well as that some of the major players are in the USA, which has holidays for the rest of this week. I'll keep an eye on it, but I'm doubtful we'll be running again this week. Sorry 'bout that!
Happy Thanksgiving...

2019-11-11: CMS job shortage Wednesday 13th November

CMS IT will be installing a new version of WMAgent on Wednesday. This will impact job availability for the duration of the intervention. We might be able to eliminate the little gremlin that's been plaguing us for the last few weeks, too.
So, please set your CMS processors to No New Tasks sometime tomorrow, Tuesday 12th, so that current tasks will stop requesting new jobs before the queues get cut. I'll let you know when jobs are available again.
Thanks.

2019-09-06: Updated server code

We have updated the lhcathome-dev server code to the latest BOINC server release, 1.1.

Please let us know if you should spot any new bug or unexpected behaviour.

2019-07-17: CMS@Home: Disruption to our condor server next Monday

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5087#39376

2019-06-07: Using a local proxy to reduce network traffic for CMS

Thanks to computezrmle, with additional work from Laurence and a couple of CMS experts (and my adding one line to the site-local-config file) there is now a way to set up a local caching proxy to greatly reduce your network traffic. Each job instance that runs within s CMS BOINC task must retrieve a lot of set-up data from our database. This data doesn't change very often, so if you keep a local copy the job can access that rather than going over the network every time.
Instructions on how to do this are available at https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.phpp?id=475&postid=6396 or https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5052&postid=39072

2019-05-14: CMS -- Please set "no new tasks"

Hi to all CMS-ers. We need to drain the job queue so that a new version of the WMAgent can be installed.
Can you please set No New Tasks so that your current tasks can run out and no new jobs start? If you have any tasks waiting to run, please suspend or abort them.
Thanks, I'll let you know as soon as the change is done.

2019-04-18: Problem writing CMS job results; please avoid CMS tasks until we find the reason

Since some time last night CMS jobs appear to have problems writing results to CERN storage (DataBridge). It's not affecting BOINC tasks as far as I can see, they keep running and credit is given. However, Dashboard does see the jobs as failing, hence the large red areas on the job plots.
Until we find out where the problem lies, it's best to set No New Tasks or otherwise avoid CMS jobs. I'll let you know when things are back to normal again.

2019-03-23: CMS jobs

The batch I submitted last night is now showing on the monitor, so you can resume tasks at will.

2019-03-23: Warning: possible shortage of CMS jobs - set No New Tasks as a precaution

There was an intervention (i.e. upgrade) yesterday afternoon[1] on the cmsweb-testbed system we use to submit CMS workflows that left things a bit confused. One problem was fixed, and the monitor shows all good. However, we are running out of CMS jobs -- maybe 10 hours left -- but the new batch I submitted yesterday isn't showing up on the testbed monitor. I submitted another last night but still neither are being shown this morning, so I submitted yet another batch.
At the moment I don't know whether the submission has failed or whether the monitor hasn't picked up the new batches. As a precaution, set No New Tasks on your CMS project(s) to avoid tasks crashing due to lack of jobs. I'll let you know as soon as I'm sure jobs are available again.

[1] How many times do I have to tell people not to touch critical systems on a Friday -- especially Friday afternoon!?

2018-11-14: Dev server updated

We have updated our development server to the latest BOINC server release.

Please let us know if you spot any issues.

The server may go down for a while if we need iterations on this.

2018-02-27: ATLAS load tests

For information: We have increased limits pr. host on the dev project as part of a campaign to test our new storage backend.

Expect most new ATLAS tasks to be pulled by our local cluster hosts and the LCG cluster in Bejing. But there are tasks for other applications too for those interested. There might be interruptions too, as this is our development and testbed. :-)

2018-01-23: LHCathome- dev server interruptions

There will be some interruptions on the dev project this week while we test a new BOINC version and migrate to another host.

2017-11-13: New logo for LHC@home?

We are planning to update our web presence for LHC@home and in this context we got a couple of proposals for a new LHC@home logo from our graphics team.

Continue to use the current logo: in extended and compact form.

Alternative new logo 1:

Alternative new logo 2:

Please vote your preference among the logos on this Doodle poll page.

Please note that this is work in progress, and that these images may be adjusted. Also if you have other proposals of your own, please do not hesitate to comment and display/link alternatives here in the forum. :-)

Many thanks in advance for your help and feedback!

..The LHC@home team

2017-10-11: Development project "face lift"

The development project has been upgraded to use the latest BOINC upstream code. Thus, it now has a new look based on Bootstrap in accordance with other projects like SETI@home.
Your feedback and suggestions are, as always, appreciated.

2017-10-06: Server Migration

Over the next week, some of the servers will be migrated in order to upgrade from SL6 to CC7. This will hopefully be transparent but if there are issues, this may be the reason.

2017-10-03: Server Upgrade

The server will be upgraded today to use the latest version of the Web interface. We anticipate that things may break so expect some instability over the next few days. What we discover by doing this will hopefully make the upgrade of the production server smoother.

2017-09-26: Reminder -- CMS jobs unavailable Weds 27th September

An upgrade to the CMS@Home workflow management system (WMAgent) is planned for tomorrow (Wed Sep 27th). This needs the current batch of jobs to be stopped so that the queue is empty. I plan to do this about 0700-0800 UTC on Wednesday.
To avoid "error while computing" task failures and the resulting back-off of your daily quotas, we suggest you set all your CMS machines to No New Tasks at least 12 hours beforehand to allow current tasks to time out in the normal way. You can stop BOINC once all your tasks are finished, if you wish.
Exactly how long the intervention will take is unclear, and there will be a delay of up to an hour to get a new batch of jobs queued afterwards. I will post here when jobs are available again, hopefully before the end of the day European time.

2017-09-12: Multicore Shutdown

A new feature has been implemented to shutdown multicore VMs at the end of their life when the idle time > busy time. This means that if we are wasting more time with empty cores than we would loose by kill the remaining jobs, those jobs are killed. This should avoid the situation where a looping Theory job keeps the VM alive while wasting idle cores. If this works the following message should be seen in the task output.

Multicore Shutdown: Idle > Busy ( 1439s > 1382s )

Let me know how it goes.

2017-08-11: Server upgrade

We will migrate the lhcathome-dev project to a Centos7 server and upgrade the BOINC server components.

The lhcathome-dev project server will be unreachable for a while later today during the upgrade.

2017-06-26: CMS application job queue is being run down.

We want to update the WMAgent job controller, so I've stopped the next batch (I hope). We should run out of jobs in 10-12 hours, so set any machine running CMS tasks to No New Tasks as soon as practicable. Should be up again tomorrow.

2017-06-23: CGI testing

We are trying to improve the job limiter that is currently active, so you might experience some instabilities for the rest of the day.

Thank you for your understanding.

2017-06-06: Dev server downtime Tue 6 June

We need to rebuild the dev VM as it runs out of memory. Thus the LHC@home-dev project server will be unreachable for part of the day today,

2017-05-23: Dev server was down today

We had an issue with our dev server and had to rebuild it today. Should be back in business now. Sorry for the trouble.

2017-04-25: Screenshot display

A new feature has been added. The screenshot that is captured in the event of an error during the job execution is now displayed at the very bottom of the result page. So, if a task errors out, the state of the VM at the time of the error can be seen in the task's results.

We would greatly appreciate any feedback on this new feature.

Thank you for your invaluable support!

2017-02-23: New native Linux ATLAS application

Hi all,

If you don't use Linux you can ignore the rest of this post. If you do you may be interested in trying the experimental ATLAS app which doesn't use virtualbox but runs natively on Linux.

IMPORTANT!! To run this app you must install CVMFS, the CERN VM File System, and configure it for ATLAS. This file system contains all the software for ATLAS WU and is normally inside the virtual image (the same as for all LHC vbox apps).

A simple installation guide can be found here: https://cernvm.cern.ch/portal/filesystem/quickstart

You should set up the repositories as shown in the example for ATLAS. If you have a squid proxy handy you can specify it there - if not I'm not sure whether it will work or not without configuring one.

Our target for this app is CERN or ATLAS-related institutes who have idle machines with CVMFS already installed, and we do not expect the average volunteer to install CVMFS and run this app. But I think all of you here are above-average volunteers :) and you may be interested in trying it.

Please give feedback on the ATLAS forums. Unfortunately there is no way to check for CVMFS on the client before requesting tasks, so if you don't have CVMFS you can still get these tasks and they will fail straight away. So better to uncheck the ATLAS app if you don't want to run it.

2017-02-17: Draining the CMS job queue

Because of an upgrade to the WMAgent server, we need to drain the CMS job queue. So, I'm not submitting any more batches at present and we should start running out over the weekend. If you see that you are not getting any CMS jobs (not tasks...) please set No New Jobs or stop BOINC.
I expect that the intervention will take place Monday morning, and hopefully we'll have new jobs again later that day.

2017-01-27: Good news for the CMS@Home application

We've now demonstrated that we can perform all the steps required to bring CMS@Home into a production tool for CMS Monte-Carlo data production. Please see this announcement at LHC@Home for more details.

2017-01-17: SSL on lhcathomedev.cern.ch

We have now got a proper certificate for our DEV server, and use the opportunity to change name to: LHCathome-dev, so please use this URL from now on for the LHC@home-dev project:

https://lhcathomedev.cern.ch/lhcathome-dev

The former test SSL URL to this project on the production server will be stopped and redirected here.

Thanks for your continued support and help!

2016-10-24: CMS Servers shut down until further notice

https://www.neowin.net/news/dirty-cow-flaw-lets-hackers-gain-control-of-linux-systems-every-single-time

YEP Linux is just the greatest and most secure OS ever ?


.....I didn't do it.......and I never liked a Dirty Cow

(OK I won't restart the OS war)

2016-10-24: CMS Servers up again

https://www.neowin.net/news/dirty-cow-flaw-lets-hackers-gain-control-of-linux-systems-every-single-time

YEP Linux is just the greatest and most secure OS ever ?


.....I didn't do it.......and I never liked a Dirty Cow

(OK I won't restart the OS war)

2016-10-24: CMS Servers shut down until further notice

I have just been informed (and confirmed) that the servers we use at RAL have been shut down due to the CVE-2016-5195 "Dirty COW" Linux kernel vulnerability. They will remain down until a patch is available and applied.
It would be prudent to suspend the CMS application in BOINC until the servers are restarted.

2016-10-05: Server Consolidation

As mentioned previously, we would like to consolidate the existing production servers (Sixtrack, vLHC and ATLAS) into a single service. We hope that by doing this we can improve the support and reduce the confusion. One benefit for all is that there will be a single forum so both us and our volunteer moderators can be more effective.

The transition will have two phases, commissioning and decommissioning. First a new server will be prepared with a similar configuration as this dev project but based on the Sixtrack DB. This is because they have the most users and 50% of the active users from vLHC and ATLAS are already there, hence it should minimize the impact. Once this new server is ready, it will be opened up for use in parallel with the existing three servers.

Next comes the decommissioning. For Sixtrack this should be straight forward, the URLs for the old host will be redirected to the new host. For vLHC and ATLAS things will be a little more complicated. Those users who are already registered with Sixtrack will be encouraged to move to the new server. For those who are not registered they can either register themselves and move or we can do a bulk registration. Tasks can then be stopped and the URLs redirected.

Finally there is the issue of credit. It should be possible to migrate the credit from the old servers to the new server. This can only really be done once the servers are no longer used. There is no time critical aspect, just until this is done, only the new credit will be seen.

Comments and feedback on this proposal are welcome.

P.S The dev project will stay around as it is.

2016-09-16: Migration to SSL

The scheduler and web pages of this dev project are now also published on the URL:

https://lhcathome.cern.ch/vLHCathome-dev

Please detach and re-attach to this new URL with your BOINC clients. The old project server is still running, and also the file upload and validation daemons run on the old server for now.

Later on after a test period, we will redirect the old URL. Then we will proceed in a similar way with our production project.

2016-08-03: Task Tracker

I have added a task tracker to the top left of the page so everyone can see what issues we know about, which one are being worked on and what is being done right now. It still needs populating with a few items.

2016-07-31: Task and CPU limiter

The server has just been updated to add the feature that limits Tasks and CPUs per user. This limit can be controlled in the project preferences.

Together with my changes to the scheduler, per-project limits on jobs in progress and #CPUs should now work. But I haven't actually tested this. Laurence, please try it and tell me if it doesn't work.
-- David


Please post any feedback in this thread.

2016-07-28: Server Upgrade Tomorrow Morning

The server will be upgrade tomorrow morning.

2016-07-28: Update

David Anderson, Rom Walton and myself had a conference call yesterday where we discussed limiting tasks per user, why we want to do this, mutli-core VMs and the VT-x issues.

One of the reasons why we would like to limit tasks is that machines can be assigned more tasks that they can handle or is desired, which leads to problems. David pointed out that BOINC should respect the resource constraints and if not the issue needs to be looked into. Feedback on this would be welcome so I have created a new thread where you can paste any scenarios where BOINC is not respecting the constraints.

Implementing a task limit per user should be straight forward. I will provide an updated php file for the project preferences and David will update the scheduler code to take this into consideration.

Similarly for multi-core, we can set a flag in the project preference on whether or not you would like BOINC provide multi-core VMs. This is an area where we probably still need to experiment.

Finally, the VT-x issue was discussed as over 30% of our failed tasks are VMs that fail to boot due to this setting not being available or enabled. It was pointed out that tasks should not be provided if the machine is not capable of running them. This will also be investigated.

2016-06-20: Scheduler and vbox update to detect 64-bit enabled computers

The BOINC scheduler has been updated to detect 64 bit machines that do not have the virtualization hardware extensions enabled. Also vboxwrapper has been updated for better error handling, and vboxwrapper 26193 is now deployed for Windows and Linux for the Theory application.

Many thanks to RomW for providing these changes!

2016-06-05: CMS Jobs Available Again

There was a fault with a server at CERN last night, which meant that we could not submit new CMS jobs, so we ran out.
However, the problem has now been fixed and CMS jobs are available again. Many thanks to the staff who worked Saturday night and Sunday to fix the problem.

2016-05-24: Infrastructure Update

The authentication server used to get the proxy has been changed. New tasks from now on will use the new server. This change should be transparent but in case everything breaks in the next few hours, this will be why.

2016-05-18: CERN Bulletin

This article was published in the CERN Bulletin yesterday.

http://cds.cern.ch/journal/CERNBulletin/2016/20/News%20Articles/2151943?ln=en

2016-05-03: Project Configuration Update 2

Some project configuration parameters have been changed to help avoid hosts being swamped with tasks and to back off problem hosts.

<daily_result_quota> 2 </daily_result_quota>
<max_wus_in_progress> 1 </max_wus_in_progress>
<max_wus_to_send> 1 </max_wus_to_send>
<min_sendwork_interval> 60 </min_sendwork_interval>

Note that the values are multiplied by the number of cores and we estimate tasks lasting for at least 12 hours.

Ref: https://boinc.berkeley.edu/trac/wiki/ProjectOptions

Please post if this causes any problems for anyone.

2016-04-30: Error Codes

At the beginning of next week work will start on providing consistent error codes and behaviour for all applications. Three error codes will be used: EXIT_INIT_FAILURE (When an error is detected on contextualizing the VM or setting up the job environment) EXIT_NO_JOBS (When the job queues are empty) EXIT_JOB_FAILURE (When an error is detected that caused all jobs to fail or all jobs have failed)

Any of these errors should cause the BOINC client to back-off. If you see any errors and one of these codes is not used, please let us know.

EDIT: To get this into the upstream release the codes have changed to: EXIT_NO_SUB_TASKS EXIT_TASK_FAILURE

2016-04-22: Refactoring

The bootstrapping code used to prepare the environment for the jobs in the VM has been re-factored so that common tasks for the five applications are abstracted to common functions. It has been tested in a VM but there may be issues relating to the diversity of our environment and error conditions.

2016-04-21: New Applications

As many of you have already seen, there are now five applications at various levels of readiness hosted in this project. The challenge now to is bring them to the level of quality required for the production project. This will involve work both on the frontends that are visible and the backends that are not. With more applications, we have to be a little more focused and use our efforts effectively. Recently there has been some good communication from everyone on the message boards and I hope that this will continue. The initial focus will be to improve the Theory application as it should be the easiest and as many components are now shared between the applications, the improvements should benefit all. We would like to thank everyone for their continued support, it really does help to make a big difference.

2016-04-18: Project Configuration Update

As we now have multiple applications, some of you have requested that we remove the restriction that limits the number running tasks per host to one. This restriction has was put in place so as not to overload machines while developing. As everyone in this development project is (or should be) an advanced user, we assume that everyone knows how to adjust their preferences to limit tasks on the client if needed. This update will be done tomorrow morning at around 10am CET (8am GMT, 1461052800 UTC).

2016-04-11: New Theory Application

A new Theory application has just been added. If you do not wish to receive these tasks, please update your vLHCathome-dev preferences.

2016-03-16: Server code update

We're down for a short while for a server code update.

2016-03-08: Change of project name

As mentioned earlier (under "Project Restructuring"), CMS-dev has evolved into a more general dev project for virtual machine applications running under LHC@home.

We will therefore rename the "CMS-dev" project to "vLHCathome-dev" as it is now a development project for early testing of applications that potentially could run in production under the Virtual LHC@home platform.

The change is planned for tomorrow 13 UTC, and you should later see this project as:
http://lhcathomedev.cern.ch/vLHCathome-dev/

Redirection will be put in place, so in principle BOINC clients should be able to follow. Otherwise please detach the project and re-attach to the new URL at your convenience.

Thanks to your contributions and feedback on tests of our applications, CMS-dev has been a success, and we would like to express our warm thanks for your contributions! :-)

Please note that this remains a development and test project, that might provide unstable applications and that any BOINC credit accumulated here might get lost.

If you prefer to just crunch and get credit, please give priority to our production LHC@home projects.

Many thanks for your collaboration!

... the team

2016-03-07: LHCb Jobs

Tasks for the LHCb application will start to be submitted. By default this has been made opt-in so you should only get tasks if you specifically ask for them. The application is still in development so only give it a try if your curious.

2016-03-04: Updated Job Agent

The CMS job agent has been updated to add some additional protections. The VM will now shutdown if there are no more jobs, no output has been produced or if too many jobs fail.

2016-03-03: New CMS App v46.26

Version No. in title and post conflicting . . .

2016-03-03: New CMS App v46.25

A new CMS application (v46.26) has been released. It provides two improvements. The first directs the CVMFS traffic to a dedicated squid proxy. By monitoring this closely we can potentially identify where it would be advantageous to place other squid caches. The second updates the CVMFS configuration so that the need to do a reload can be avoided in the future and hence the boot time will be reduced.

2016-03-03: New CMS App v46.26

A new CMS application (v46.26) has been released. It provides two improvements. The first directs the CVMFS traffic to a dedicated squid proxy. By monitoring this closely we can potentially identify where it would be advantageous to place other squid caches. The second updates the CVMFS configuration so that the need to do a reload can be avoided in the future and hence the boot time will be reduced.

2016-03-03: LHCb Application

An initial alpha version of the LHCb application has been added. There are currently no tasks for now. Two new topics have been created for the message boards so that discussions on the CMS application and LHCb application can be kept separate.

2016-03-03: We may have made a mistake...

If you get error messages about site-local-config like the ones in this message, please abort your current task and start again. We made a change yesterday afternoon to speed up booting, but things aren't behaving as we'd expected. We've reverted to the previous version, but you need to reboot the VM to pick up the changes (...or let it spew out job failures for 24 hours...). :-(
Sorry, it was my idea but things obviously weren't working exactly as I'd thought.

2016-03-02: Change Log

This thread will be used to provide information on all the changes that are made to help correlate issues with potential causes. It is needed as not all changes are tied to a new application release such as with the supporting infrastructure.

2016-03-01: Project Restructuring

As discussed in other threads, we would now like CMS-dev to become our general testing project. In order for this to happen the project should support sub-projects similar to PrimeGrid and per app credit. We are aiming to add these features within the next few days.

2016-02-26: Out Of Jobs

We are out of jobs and are fighting will a few other issues. I have stopped new tasks being sent for now. A feature to handle this situation more gracefully is on the work plan. I hope that we can be back running after the weekend.

2016-02-19: Infrastructure Issues

There is an issue with one of our servers that will stop new glideins (runs) from working. In theory the VMs should just idle until this is fixed.

2016-02-16: Suspend/Resume

The suspend/resume issue should now be resolved so it should be possible to pause/save a VM for up to 48 hours without loosing the current job. This will only work with new tasks so please start a new one if you would like to test this. As usual, please post any message to this thread if you find any problems.


EDIT: Just as a reminder the job would evicted after about suspending the VM for 20mins. If you do a test and it is fine, please also post and say for how long the VM was suspended.

2016-02-10: New App Version For Linux and Windows

A new app version (CMS v46.23) for Linux and Windows has been provided. It contains vboxwrapper v26183 which provides a heartbeat mechanisms that can detect if the VM fails to boot or freezes. It should prevent VMs just sitting idle if such scenarios as a kernel panic occurs at boot. A Mac version will be made available once a build is available. As usual, please let us know if there are any issues with this release.

2016-02-09: Workplan

This thread will be used to provide information on the status of issues/improvements. Strikethrough will be used to show items done (check for issues), bold for in progress and italics for things on the to-do list.

2016-02-09: Zombie Tasks

The issue where many tasks have been sent to the BOINC client but only one runs is being investigated. The number of tasks per day per user has been reduced from 500 to 20 in an attempt to reduce the impact. This is probably why only a few jobs are running as the clients are now blocked from getting a new task that may run. A project reset might fix this or maybe not. If anyone has any information that may help, please post.

2016-02-05: Server Restart

The sever has been restarted to hopefully address the issue of multiple tasks being sent to the BOINC client.

2016-02-03: Aborted VMs

It has been noticed that some VMs are aborting. This maybe due to them running out of memory as no swap space has been configured. The bootstrap script has been updated so that next time the VM is started some swap will be configured.

2016-02-03: New plot for the jobs stats

A new plot has been added to the CMS job stats page. It shows the wallclock consumption for successful and failed jobs. This is a better indicator than using the number of jobs as it is not affected by job length.

2016-02-02: Updated Agent

The job agent has been updated with the main aim to get logging messages back into stderr_txt so it can be seen for each task. The following changes were made: Re-factored to get the BOINC information from init_data.xml rather than fd0 Re-factored logging to use a common logging function Use the vboxmonitor command to also send the messages to the VBox.log Shutdown rather than sleep on errors Report shutdown reasons using the completion trigger file

This update provides a handle for shutting down the VM when problems are detected and providing the reason to the BOINC client. Details are also archived for the task which should help troubleshooting and improve overall support.

2016-01-31: Graceful Shutdown Now Implemented

The graceful shutdown of VMs has now been implemented. When the VM is older than 24 hours, after the current run has finished the VM will shut itself down using the completion_trigger_file method. More precisely, a file is placed in a shared directory between the host and guest that signals to the BOINC client that the task has ended. To verify that the VM was gracefully shutdown, the message VM Completion File Detected should be seen in the stderr_txt of the task. This required new app version to be released (v46.22) that contains the following changes to the job description:
Set the completion trigger file to be shutdown Enabled the shared directory Increased the job duration to be 36 hours (to avoid the BOINC client from shutting it down but still there for protection) Copied the init_data.xml to the shared directory (to support later improvements)

2016-01-30: Poll

As discused in a recent thread, There will potentially be 6 LHC related applications (Six Track, Test4Theory, ALICE, ATLAS, CMS and LHCb) and hence between 1 and 12 projects depending on how things are organised. The options are:
1. One project with beta apps
2. Two projects; prod and dev
3, One project and six dev projects
4. Six prod projects and six dev projects

What would you prefer?

http://doodle.com/poll/esktqvrikqmpmyp2

2016-01-29: Constructive suggestions please

As mentioned elsethread, I have to prepare a summary of required/desired improvements to CMS@Home to take it up to production readiness. Please post suggestions and criticisms in this thread. Please keep it short and non-personal, as I'll have to de-serialise the thread to make my report.

2016-01-27: Migrating to vLHC@home

As most of you already know, the aim of this project was to get the CMS application to a point where it was mature enough to be added to the vLHC@home project as a beta app. We believe that we have now reached that point and would like migrate our activity to that project.

The CMS beta application which should be identical to this one is now available in vLHC@home. Out of the 190 volunteers that have credit, 89% already have a vLHC@home account. Please could everyone who is running here try out the CMS beta application from vLHC@home. To do this you will need to go to the vLHCathome preferences and enable CMS Simulation.

http://lhcathome2.cern.ch/vLHCathome/prefs.php?subset=project

If the beta app is working for you, please stop running here. Once most have migrated, no new tasks will be created and the accumulated credit can be migrated from here to vLHC@home.

Please post any comments or issues relating to the migration in this thread.

Thanks to everyone who has supported this project and enabled us to get to where we are today.

Laurence

2016-01-03: Important information on upload bandwidth

We've been puzzling for a while as to why we were getting a lot of "stage-out" failures -- i.e. problems returning result files to data storage.
I'd pushed up lately to CMS jobs taking 2-5 hours, depending on processor speed, and returning ~150 MB result files. This means that on average each VM is returning ~50 MB/hr (or to put it another way, at an upload speed of 1 Mbps, returning a result file would take 1500 seconds, or 25 minutes).
It seems technology is roughly consistent across the world, and many consumers are still on ADSL broadband -- where the A means Asymmetric, that is upload speed is usually much slower than download speed. Upload speeds around, or even less than, 1 Mbps seem to be the norm for ADSL broadband.
So, the problems started occurring when enthusiastic volunteers started running several machines at once on their home networks. This meant that the total load on the upstream channel exceeded availability, uploads stalled and we started getting transfer time-outs.
So, the caution to take away from this is to make sure you know you upload speed, and make sure you don't run so many machines that they take your line into saturation.
I believe there are some workloads we could commission with a somewhat smaller MB/hr result generation; I'll let you know if we can start running them.

2015-10-29: Updated Agent

The CMS Job Agent used in the VM has been updated. It contains the following changes:

* Fixed 1 hour (not) sleep issue
* Support for non-BOINC instantiations
* Added support for running under for vLHC@home

This update should be transparent but if not please let us know by posting a reply to this message.

2015-10-22: VirtualBox wrappers upgraded to 26178

The VirtualBox wrappers for Windows, Linux and Mac have been upgraded to 26178.

It contains the following fixes:
* VBOX: Add code to handle search path modification for Linux and Mac.
* VBOX: On a hypervisor detection failure dump all the logs to stderr, it would have quickly exposed a search path change on Mac OS X.
* VBOX: Reduce the amount of disk I/O when parsing the VM log file
* VBOX: Fix a regression introduced in 26172 with starting up a VM

Let us know how it goes.

2015-10-20: New jobs available

I've now submitted a larger batch of jobs since the failure rate seems manageable. There were a few host IP addresses recurring amongst the failures, I'll keep an eye out out for them in future and contact the owners if they continue to misbehave. You can start running tasks again now if you wish.

2015-10-13: New vboxwrapper

We've released new versions of CMS@Home with the latest vboxwrapper 26175. See the discussion in Number Crunching for the effects this has.

2015-10-13: New developments

We're at the stage where we have to make disruptive changes to the workflow, in order to get the results onto the Grid from the data-bridge. At some point soon we'll start getting errors for jobs in the current batch, at which time I'll ditch the rest and submit a small test batch. If we're lucky that may be the end of it, we'll have to see.
Thanks in advance for your understanding.

2015-10-12: CMS beta in Virtual LHC@Home

Some tasks are now being run through vLHC@Home -- see this thread for details.

2015-10-09: Possible short outage...

I'm about to try manually to install a new certificate proxy, as the default 7-day initial proxy is about to expire. This is the first time we've done this, so it may not work -- if it fails, expect to see job failures. I'm not sure if the jobs will fail before they get to your tasks or after... If I see failures I'll submit a new batch immediately, so don't panic if you see failures, we should ride it out OK.

2015-09-29: Logo

Shall we make a logo to sit up there in the top left-hand corner?
Here's my suggestion:

2015-08-31: Jobs incoming!

Patches have been applied, jobs should be ready when you want them, Enjoy!

[Edit] Confirmed, jobs are available now. [/Edit]

2015-08-28: Progress!

We are making great progress and are just chasing up the last remaining issues. One of the recent improvements was to create the link to the CMS monitoring infrastructure. This means that we can generate nice plots similar to what ATLAS@home have for their project.

2015-08-25: Some jobs again

We can now submit jobs again. I'll submit a test batch overnight, and then try for a bigger test for the rest of the week. Feel free to start running tasks again, and report problems (and successes...) in the usual places.
Thanks.

2015-08-22: No new jobs

We've run out of jobs on the Condor server. Until I can sort out the glitch that's preventing me submitting new jobs you can all take a rest for the weekend, or switch to backup projects.
Cheeers, ivan

2015-08-20: Agent Fixed

The agent is fixed. If you experience any problems, please abort the old task first to verify the issue in a new task and then post a reply in this thread.

2015-08-19: Agent Broken

We have an issue with the agent so the VMs will not get new jobs until this has been resolved.

2015-08-19: New CMS Agent

We are just about to push a new CMS Agent to CVMFS. The code has been re-factored to be much simpler, less code = less bugs :)

It should appear in a few hours, let us know if there are any problems.

2015-08-17: Helpful tips for new users

This thread is to collect useful information in one place. Feel free to add your tips here.
================
A bug has been found in the Windows version of BOINC which means that files larger than 4 GiB (2^32 bytes) are being left behind in slot directories, affecting us and other BOINC projects. Unfortunately, we do produce VM files that large, so we are interfering with these other projects. If you are active on this project, using Windows, please update your BOINC version to a patched version (see this message and thread).

"The files that are needed to apply the hotfix are

For 64-bit BOINC
boinc.080515.x64.zip

For 32-bit BOINC
boinc.080515.x86.zip

Simply extract the two files for your version from the .zip archive, and copy them to your BOINC program folder - you'll need to stop the BOINC client while you do this, and restart it again afterwards."

2015-08-15: We have results!

First CMS@Home results returned to storage!



(Don't try to look at the page yourself, unless you have CMS credentials.)

2015-08-04: Agent Update

We will shortly be updating the agent that is running in the virtual machine. This should start using the new infrastructure that we have been working on recently. It may not work first time so if you have any feedback, please respond to this post.

2015-07-21: Real CMS Jobs

Over the past few months we have been re-engineering our internal infrastructure so that we can send real CMS simulations jobs to the CMS@home project and are nearly ready to try this out for real. From the VM side of things, we only need to update the CMSJobAgent.py script which will be done via the magic of CVMFS [1], so no new application release will be needed.

Although this should be transparent, there is a high chance we will temporary break something, so please bear with us during this potential period of instability. I would estimate that we will be ready within the next two weeks and I will send an announcement before we make any changes.

Many tanks to all of you who have been supporting with the testing of this project.

[1] http://iopscience.iop.org/1742-6596/219/4/042003

2015-05-19: Urgent Update for Windows Users

A bug has been found in the Windows version of BOINC which means that files larger than 4 GiB (2^32 bytes) are being left behind in slot directories, affecting us and other BOINC projects. Unfortunately, we do produce VM files that large, so we are interfering with these other projects. If you are active on this project, using Windows, please update your BOINC version to a patched version (see this message and thread).

"The files that are needed to apply the hotfix are

For 64-bit BOINC
boinc.080515.x64.zip

For 32-bit BOINC
boinc.080515.x86.zip

Simply extract the two files for your version from the .zip archive, and copy them to your BOINC program folder - you'll need to stop the BOINC client while you do this, and restart it again afterwards."

Thanks to the community for helping us debug this, especially Richard, Crystal Pellet and Ray, and to the BOINC crew for coming up with the fix.

In the meantime, we will move on to debugging why our VMs are currently growing so large.

2015-04-11: VBox Wrappers Updated to 26165

The VirtualBox wrappers for Windows, Linux and Mac have been upgraded to 26165.

It contains the following fixes:
* VBOX: Add VboxStartup.log to the list of partial log dumps to stderr when something goes wrong.
* VBOX: Remove unneeded files.

In addition the plan class vbox64 has been specified for the platform and the parameter has been added to the version.xml.

Let us know how it goes.

2015-04-08: VBox Wrappers Updated to 26164

The VirtualBox wrappers for Windows, Linux and Mac have been upgraded to 26164.

It contains the following fixes:
* VBOX: Check for a valid Console pointer before attempting to pause/resume the VM.
* VBOX: cut down on some of the noise with spurious 'Status Report' messages when we attempt to launch the VM.
* VBOX: After adding in a new VirtualBox COM interface, you must hook up the plumbing.
* VBOX: Only add the guest additions ISO to the VM if the file has actually been detected on the file system.
* VBOX: Add COM support for VirtualBox 5.0.

Let us know how it goes.

2015-03-28: VBox Wrappers Updated to 26160

The VirtualBox wrappers for Windows, Linux and Mac have been upgraded to 26160.

It contains the following fix:
* VBOX: If polling for the current VM state fails for any reason, like vboxsvc crashing, do a temp exit for 24 hours.


Let us know how it goes.

2015-03-27: Windows VBox Wapper Updated to 26159

The VBox wrapper for Windows has been upgraded to version 26159. It contains these fixes:
* VBOX: Add better error checks when handling COM error conditions.
* VBOX: Add additional check for a valid pointer to prevent crash condition.

Please let us know if you have any problems.

2015-03-26: VBox Wrappers Updated to 26158

Hi, thanks for waiting. Unfortunately I got side-tracked for a while due to CERN's changing the rules since the last time I renewed my contract...
I've changed the vboxwrapper to V.26158, so you can resume tasks if you want. It seems to run fine on Windows and Linux. Please run a test WU, esp. if you're on a Mac. (Maybe I can leverage this project to get a Mac myself!)
I've added the new feature to the vbox_job.xml file which will attempt to use 'savestate' instead of 'poweroff' when gracefully shutting down.
I haven't added another new feature , which will prevent snapshots from being created by vboxwrapper, because it wasn't immediately obvious to me how the volunteer overrides this, or sets his own checkpoint schedule -- no doubt this will be clarified to me in the very near future. I'm wary of just disabling checkpointing at this stage.

2015-03-26: Please set No New Tasks for a short while

Gruezi Mittenand;
I'm about to make my first solo release (to update the vboxwrappers). It's inevitable that I'll make mistakes, so please set NNT until I give the word to go, to avoid picking up some intermediate flawed state.
Thanks.

2015-03-22: VBox Wrappers Updated to 26157

The VirtualBox wrappers for Windows, Linux and Mac have been upgraded to 26157. Also the tag enable_cern_dataformat has been removed from in the job XML file.

Let us know how it goes.

2015-03-21: MAC VBox Wapper Updated to 26156

I have just updated the VBox wrapper for MAC to version 26156. Please let us know if you have any problems.

2015-03-20: A Message To All Our Volunteers

First of all I would like to take this opportunity to thank all our volunteers, especially those who are active on the message boards and are helping us to evolve this project. Without volunteers we can not do volunteer computing.

The goal of this project is to develop what is required so that the CMS collaboration can use this resource for computational intensive tasks such as producing Monte Carlo events, simulated collisions within the detector. Due to the complex nature of the software and the difficulty with maintaining up-to-date ports on different platforms, the Virtualized approach is being used. This means that the BOINC tasks are just virtual machines that run for 24 hours. When the virtual machines start, they download the real computational task from our own infrastructure. For now these tasks are just many copies of the same example so please don't dedicate too many resources as the results will not be used. Your computing cycles are a valuable resource and there are many other projects that could benefit from them. We still have quite some work to do on our back-end infrastructure so that the collaboration can seamlessly direct tasks here and receive back the results.

Our vision is that in the near future when the application and back-end infrastructure is mature, we can include it in the vLHC project as another application. When this happens it will not be possible to transfer the credit so please add your resources there now if that would annoy you.

Once again thank you for your participation and helping us to get this off the ground.

2015-03-20: Vbox Wrapper Updates

The VirtualBox wrappers for Windows and Linux have been upgraded to 26156. The Mac wrapper has been downgraded to 26105.

Let us know if you have any problems. I will post another general news item soon providing more details about this project.

2015-03-20: VBox wrapper problems

After upgrading to the version 26155 of the VBox wrappers, we have experienced some problems. Rather than reverting back to a working state we are going to push forwards and help debug them. We hope that this way our development project can then help those in production.


Cheers,

Laurence

2015-03-18: New Release (v46)

The vboxwrappers have been upgraded to 26155 so that we are now in sync with vLHC@Home.

Please let us know how things are going by posting on the message boards.

Thanks,

Laurence

2015-03-13: New Release (v45)

A new app version has been released (v45). This is a FAT image (548MB compressed) that contains many of the files were downloaded via CVMFS. It should reduce the amount of network traffic and hence make everything a little bit more efficient.

Please let us know how things are going by posting on the message boards.

Thanks,

Laurence

2015-03-12: New Release (v44)

A new app version has been release (v44). The main fixed has been to an issue whereby jobs were failing on machines due to configuration files in CVMFS not being discovered automatically. Various other minor things have been clean up. The image is now small (15MB) but will download about 1GB of files via CVMFS. This may be changed in a future release where we will increase the size of the download image by already including may of the needed files.

Please let us know how things are going by posting on the message boards.

Thanks,

Laurence

2015-02-25: Another new image and access to the log files

Once again we have updated the VM images, so it would be nice if you could get a new VM.

This time we have done multiple things:
1) Credit problem
We are hoping to address the credit problem in this release, but we don't know if our modifications will do the job... So please report back how it's going for you.

2) We have implemented a web server on the vm, so that now you should be able to press the button ?show graphics? on your job in the boinc manager. When your web browser opens, you should see the sample page of t4t. Don't think about it too much, that are just some sample images, that are included in the t4t-webapp package. For now we (as the current developers) don't have the knowledge to produce such images out of the CMS framework, so that this will be done later by people from CMS.
However you can look at the logs, that are produced by the CMSJobAgent (which fetches the jobs) and the cmsRun (the actual CMS program). Just click on the Logs button and you will be there.


Some questions about the logs already arose internally (thanks to Ben) so some short comments on that:

1) As you might notice, we have two versions of each log . One is produced by tail one by dumbq-logcat. Personally I liked how dumbq is able to timestamp the output (we use it for the consoles as well), but it seems to have some difficulties, when being directed to a file, so you will notice, that the log stops at random places, but then continues from there as it gets new input.
I still haven't figured out why that is...
So as a conclusion, you might want to look at the logs which have "tail" in their name.

2) stderr and stdout seem to be swapped sometimes
The reason for this is, that our server dose not have a valid certificate, so wget ends up dumping it's log to the stderr.

You should find this in your logs:
Connecting to data-bridge-test.cern.ch|128.142.154.228|:443... connected.
WARNING: cannot verify data-bridge-test.cern.ch?s certificate, issued by ?/C=--/ST=SomeState/L=SomeCity/O=SomeOrganization/OU=SomeOrganizationalUnit/CN=data-bridge-test/emailAddress=root@data-bridge-test?:
Self-signed certificate encountered.
WARNING: certificate common name ?data-bridge-test? doesn't match requested host name ?data-bridge-test.cern.ch?.

2015-02-20: Server failure over the night

Over the night we experienced a server failure in our back end server, which feeds jobs into the VM's.
So it might have been that your VM was sitting around doing nothing at that time...
We have now restarted the server and everything is back to normal.
Your VM's should be receiving jobs again :)

Unfortunately at the moment we do not completely understand, what caused the server to crash, but we hope to figure that out soon!

EDIT: In case your VM is still not running properly, getting a fresh VM by getting a new Boinc job/wu should solve this.

2015-02-19: New VM image and new console feature!

We have just updated the VM image. So please abroad your running jobs/wu's and get a new one.
The new image should have improved stability and use your cpu cycles properly.

As well we have included a new feature, so that you can now see information about the job, etc. on the consoles. (Similar to test4theory/vLHC)
You can open the VM console by clicking on the CMS-dev job in your BOINC Manager and then on the "show VM Console"-button on the left. An rdp client should open automatically and connect to the VM.
Once there you can look through the different consoles. They are as following:
1: Job output stdout [white]
2: JobAgent stdout [white]
3: top
4: Job output stderr [red]
5: JobAgent stderr [red]

On Windows you can use Ctl-Al-F[n] to jump to the Consoles, on Linux you should try it with Alt-F[n]
The output is still a bit messy, but pleas bear with us.
A graphic version as in t4t is coming soon -- to a VM near you :D

As well we were just successful, to run the VM on a Microsoft Surface, which is probably the first time ever that a CMS Job was run on a Surface :D

2015-02-18: Restriction of new account creation

Unfortunately, there has been a spate of rather dodgy accounts being created in the last few days (try browsing profiles at random...), so we have had to limit new accounts for the time being.
The limiting mechanism is by Invitation Codes.
If you would still like to join this nascent project, then please send an e-mail to Ivan.Reid@CERN.ch with the Subject: "CMS@Home Request" and we will consider your request. Obviously we will consider such factors as whether you are already contributing to BOINC projects before sending an invitation code. Note that a decision may not be immediate.

2015-02-13: Welcome to the CMS development project

Welcome. We've opened up the forums in case anyone wants to contribute.
We're still under development so don't waste too many cycles on it yet -- only run one or two jobs at a time. Let us know of any problems.
I believe that there are incompatibilities on Windows with VirtualBox versions beyond 4.3.12 (at least that's what vLHC@Home has found), but I seem to be able to run with 4.3.20 on Linux. I don't have a Mac box myself (yet...) so I have no experience there.